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Preface 


The availability of many-core computing platforms enables a wide variety 
of technical solutions for systems across the embedded, high-performance 
and cloud computing domains. However, large scale many-core systems are 
notoriously hard to optimise. Choices regarding resource allocation alone can 
account for wide variability in timeliness and energy dissipation (up to several 
orders of magnitude). This book covers dynamic resource allocation heuristics 
for many-core systems, aiming to provide appropriate guarantees on perfor- 
mance and energy efficiency. It addresses different types of systems, aiming to 
harmonise the approaches to dynamic allocation across the complete spectrum 
between systems with little flexibility and strict performance guarantees all 
the way to highly flexible systems with soft performance guarantees. 

Resource allocation is one of the most complex problems in large multi- 
processor and distributed systems, and in general it is considered NP-hard. The 
theoretical evidence shows that the number of possible allocations of applica- 
tion tasks grows exponentially with the increase of the number of processing 
cores. The empirical evidence points in the same direction, with case studies 
showing that for a realistic multiprocessor embedded system (40-60 applica- 
tion components, 15-30 processing cores) a well-tuned search algorithm had 
to statically evaluate hundreds of thousands of distinct allocations before it 
finds one that meets the systems performance requirements. 

In this book, we argue that the only way to cope with such complexity 
is to design systems that are capable to explore the allocation space during 
runtime. This is commonly done in cloud and high-performance computing, 
mainly because the workload of such systems cannot be accurately predicted 
in advance and static allocations are thus impossible. In embedded systems, 
the workload is more predictable in terms of its worst-case behaviour, but 
static allocations that take such characterisation into account tend to produce 
underutilised platforms. We therefore set the scene for dynamic resource 
allocation mechanisms by identifying and evaluating allocation heuristics that 
can be used to provide different levels of performance guarantees, and that 
cope with different levels of dynamism on the application workload. 


xi 


xii Preface 


The book starts with a description of the common practices and challenges 
in dynamic resource allocation, highlighting the peculiarities of each domain: 
embedded, HPC and cloud computing. Then, each of the challenges is 
addressed in detail within the following chapters, which are largely self- 
contained and therefore can be read in any order. To facilitate understanding, 
all of them follow the same structure: a specific challenge is motivated and 
the respective problem is precisely formulated; a detailed description of a 
solution to the problem is then given, followed by experimental work showing 
quantitative evidence of the strengths and weaknesses of that solution; related 
work is reviewed; and a summary of the chapter is given at the end. 

The technical work that resulted in this book was done within the frame of 
the DreamCloud project, and the project website! makes available a number 
of reference implementations of the models and heuristics described here. 
Updates to this book will also be made available on that website. 


Leandro Soares Indrusiak, 
Piotr Dziurzanski, 


and Amit Kumar Singh, 


York, summer of 2016. 


“http://www.dreamcloud-project.org 
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1 


Introduction 


The availability of highly parallel computing platforms based on multi 
and manycore processors enables a wide variety of technical solutions for 
systems across the embedded and high-performance computing domains. 
However, large scale manycore systems are notoriously hard to design and 
manage, and choices regarding resource allocation alone can account for 
wide variability in timeliness and energy dissipation, up to several orders 
of magnitude. For example, the allocation of many computation-centric jobs 
to the same processing core, or communication-intensive jobs to cores linked 
by alow bandwidth interconnect, can significantly impair system performance 
specially in applications with many dependencies between jobs. 

Techniques to allocate computation and communication workloads onto 
processor platforms have been studied since the early days of computing. 
However, this problem has become significantly harder because of scale 
and dynamicity: compute platforms now integrate hundreds to thousands 
of processing cores, running complex and dynamic applications that make it 
difficult foresee the amount of load they can impose to those platforms. 

Elementary combinatorics provides us with evidence of the problem of 
scale. For a simple formulation of the problem of allocating jobs to processors 
(one-to-one allocation), one can see that the number of allocations grows with 
the factorial of the number of jobs and processors. For example, a system with 
4 jobs and 4 processing cores can have P(4, 4) = 24 possible allocations, but 
simply by doubling the number of jobs and cores the number of allocations 
becomes P(8, 8) = 40320 (where P(n, k) denotes the k-permutations of n). 
The empirical evidence points in the same direction, as it can be seen in [110] 
that for realistic manycore embedded systems (40-60 jobs, 15-30 processing 
cores) a well-tuned search algorithm had to statically evaluate hundreds of 
thousands of distinct allocations before it finds one that meets the systems 
performance requirements. 

To cope with dynamicity, a dynamic approach to resource management is 
the most obvious choice, aiming to dynamically learn and react to changes to 
the load characteristics and to the underlying compute platform. The baseline, 
which is a static allocation decided before deployment based on the (nearly) 
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complete knowledge about the load and the platform, is no longer viable. For 
example, static resource allocation in high-performance computing (HPC) has 
often been referred as a significant cause of low utilization of servers, which 
results in cost increases on hardware and energy [17]. Static allocation is 
also commonly used by aerospace and automotive industries to provide worst 
case performance guarantees that are required by certification authorities. 
However, it is well known that such an approach usually leads to under-utilised 
computing and communication resources at run-time [113]. 

The problems of scale and dynamicity are also made harder with the 
increasing density of computing and communication resources. The definition 
of density used here is not necessarily spatial, but rather on connectivity (i.e., 
dense graph). In densely connected systems, a resource allocation algorithm 
may have to make decisions very often due to the system dynamics, and may 
have to consider dozens or hundreds of potential allocation possibilities at each 
decision point (i.e., which processor should execute each job, which communi- 
cation links should be used when those jobs exchange data). Furthermore, such 
algorithms have to work in a distributed way due to the difficulty to obtain the 
up-to-date state of the whole system. And despite such levels of complexity, the 
algorithms themselves are also subject to tight constraints in performance and 
energy. It is then evident that optimal resource allocation algorithms cannot 
cope with this type of problem, and that lightweight heuristic solutions are 
needed. 

This book is therefore concerned with the kinds of resource allocation 
heuristics that can cover different levels of dynamicity, while coping with the 
scale and complexity of high-density manycore platforms. 


1.1 Application Domains 


The level of dynamicity of a system denotes how often it changes its 
characteristics. In this book, we are concerned with resource allocation, so 
dynamicity means how much variation can be found on the system workload 
(e.g., arrival patterns, computation and communication requirements, value 
to the end-user) and on the underlying compute platform (e.g., degradation 
or lost of performance due to faults, increase in capacity due to upgrades). 
Different application domains can be characterised by their typical levels of 
dynamicity. 

For example, deeply embedded systems such as those in automotive, 
aerospace and medical domains have low dynamicity, and often their entire 
functionality and behaviour is known at design time, prior to deployment. The 
low dynamicity makes the performance of such systems easier to predict, and 
therefore guarantees regarding timeliness can be made (e.g., ECG signal of a 
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complete cardiac cycle will be processed in less than 10 ms). Such guarantees 
are often enforced by means of resource reservation and isolation, which 
can lead to very low levels of resource utilisation: a processing core can be 
exclusively allocated to a given job for the sake of performance predictability, 
but that job only needs the core to its full capacity for a limited period of its 
lifetime, leaving it subutilised for the rest of the time. 

On the other hand, HPC and cloud computing have high dynamicity due 
to the wide variety of workloads they have to handle. That makes it harder 
to make performance guarantees, because one never knows what comes next. 
And due to the cost of deploying and maintaining such platforms, they are 
often only viable if operated at saturation point, with nearly 100% utilisation, 
which undermines performance guarantees even further by making nearly 
impossible to rely on resource reservation or isolation. 

Figure 1.1 below shows both domains, embedded and HPC/cloud over the 
dimensions of dynamicity, typical resource utilisation and the ability to sustain 
performance guarantees. State-of-the-art resource allocation in the embedded 
domain is static, relying on the low dynamicity of those systems and producing 
allocations that can be derived at design time and used for the whole lifetime 
of the system, while ensuring the performance requirements are met even in 
worst case scenario. For HPC and cloud, the resource allocation is completely 
dynamic and often based on instantaneous metrics such as order of arrival of 
jobs and current utilisation of cores, which can certainly keep the platforms 
running at saturation point but cannot offer any performance guarantees. 

Recently, the dichotomy described above became less visible. Embedded 
systems are becoming increasingly complex, having to cope with dynamic 
workloads, and using less predictable platforms (i.e., multi-level caches, 
speculative execution), while still having to fulfil strict performance requisites. 
HPC and cloud computing, in turn, critically need to address fundamental 
problems in energy efficiency and performance predictability, as they become 
more widespread and critical to our daily lives. This points to the importance 
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Figure 1.1 Application domains and their characteristics with regard to dynamicity, resource 
utilisation and performance predictability. 


4 Introduction 


of the areas in the central part of Figure 1.1, which represents increasingly 
dynamic embedded systems and predictable HPC and cloud systems. 

The goal of this book is to identify and present resource allocation heuris- 
tics that can be used to achieve different levels of performance guarantees, and 
that can cope with different levels of dynamicity of the application workload. 


1.2 Related Work 


The problem of allocating tasks to platform elements is a classic problem in 
multiprocessor and distributed systems. Most formulations of this problem 
cannot be solved in polynomial time, and many of them are equivalent to 
well known NP problems such as graph isomorphism [18] and the generalised 
assignment problem [58]. 

This problem was first addressed from the cluster/grid point of view, but 
more recently the fine-grained allocation of tasks within manycore processors 
has also received significant attention due to its critical impact on performance 
and energy dissipation. In the following subsections, we consider allocation 
mechanisms at both grid and manycore CPU level, and review the most 
significant trends and achievements in terms of guaranteed performance and 
energy efficiency. 


1.2.1 Allocation Techniques for Guaranteed Performance 


There are numerous multiprocessor scheduling and allocation techniques that 
are able to meet real-time constraints, each of them under a different set 
of assumptions. A very comprehensive survey is given by [41], covering 
techniques that can be applied both at the grid or many-core level, but all 
of them assume that the platform is homogeneous and tasks are independent 
(i.e., do not explicitly consider communication costs). Many of them also 
assume that the allocation is done statically, or do not take into account 
the overheads of dynamically allocating and migrating tasks (1.e., context 
saving and transferring). In [96], heterogeneous platforms are considered but 
communication costs and overheads are still not taken into account. 

Significant research on resource reservation has been done, aiming to 
increase time-predictability of workflow execution over HPC platforms [90]. 
Many approaches use a priori workflow profiling and use estimation of 
task execution times and communication volumes to plan ahead which 
resources will be needed when tasks become ready to execute. Just like 
in static allocation, resource reservation policies significantly reduce the 
utilisation of HPC platforms. A reduction of 20-40% in the utilisation is not 
unusual [150]. 
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Allocation and scheduling heuristics based on feedback control have been 
used in HPC systems [44-83], aiming to improve platform utilisation without 
sacrificing performance constraints. Most cases concentrate on controlling 
the admission and allocation of tasks over the platform based on a closed- 
loop approach that monitors utilisation of the platform as well as performance 
metrics such as task response times [54]. 

Many cloud-based and grid-based HPC systems use allocation and 
scheduling heuristics that take into account not only the timing constraints 
of the tasks but also their value (economic or otherwise). This problem been 
well-studied under the model of Deadline and Budget Constraints (DBC) 
[27], where each task or taskflow has a fixed deadline and a fixed budget. 
State-of-the-art allocation and scheduling techniques target objectives such 
as maximising the number of tasks completed within deadline and/or budget 
[139], maximising profit for platform provider [76] or minimising cost to users 
[130] while still ensuring deadlines. Several approaches to the DBC problem 
use market-inspired techniques to balance the rewards between platform 
providers and users [154]. A comprehensive survey given in [157] reviews 
several market-based allocation techniques supporting homogeneous or hete- 
rogeneous platforms, some of them supporting applications with dependent 
tasks modeled as DAGs. 

At the many-core level, there are a few allocation techniques that take 
into account both the computation and communication performance guar- 
antees. Such techniques are tailored for specific platforms e.g., many-cores 
based on Network-on-Chip (NoC). To guarantee timeliness, all state-of- 
the-art approaches rely on a static allocation of tasks and communica- 
tion flows. In [6], a multi-criteria genetic algorithm is used to evolve 
task allocation templates over a NoC-based many-core aiming to reduce 
their average communication latency. The approach in [110] also used a 
genetic algorithm that could find an allocation that can meet hard real- 
time guarantees on end-to-end latency of sporadic tasks and communication 
flows over many-cores that use priority-preemptive arbitration. Stuijk [136] 
proposed a constructive heuristic to do static allocation of synchronous 
dataflow (SDF) application models [133], which constraint all tasks to 
read and write the same number of data tokens every time they execute. 
The allocation guarantees the timeliness of the application if the platform 
provides fixed-latency point-to-point connection between processing units. 
In [161], the same author relaxes some of the assumptions of SDF appli- 
cations (i.e., allows for changes on token production and consumption 
rates during runtime) and proposes analytical methods to evaluate worst- 
case throughput and to find upper bounds for buffering for a given static 
allocation. 
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1.2.2 Allocation Techniques for Energy-efficiency 


Most allocation techniques addressing energy efficiency operate at the many- 
core processor level, mainly because of the difficulties of dealing with energy- 
related metrics at larger system granularities. 

Hu et al. [60] and Marcon et al. [88] estimate the energy consumption 
according to the volume of data exchanged by different application tasks over 
the interconnection network. Such approaches lack in accuracy as they do not 
take into account runtime effects such as network congestion or time-varying 
workloads. Thus, research approaches on energy-aware dynamic allocation 
techniques have been proposed. 

In [129], an iterative hierarchical dynamic mapping approach aims to 
reduce energy consumption of the system while providing the required QoS. In 
such strategy, tasks are firstly grouped by assigning them to a system resource 
type (e.g., FPGA, DSP, ARM), according to performance constraints. Then, 
each task within a group is mapped, minimising the distance among them and 
reducing communication cost. Finally, the resulting mapping is checked, and 
if it does not meet the application requirements, a new iteration is required. 

Chou and Marculescu [37] introduce an incremental dynamic mapping 
process approach, where processors connected to the NoC have multiple 
voltage levels, while the network has its own voltage and frequency domain. 
A global manager (OS-controlled mechanism) is responsible for finding a 
contiguous area to map an application, and for defining the position of the 
tasks within this area, as well. According to the authors, the strategy avoids 
the fragmentation of the system and aims to minimize communication energy 
consumption, which is calculated according to Ye et al. [155]. This work 
was extended in [36, 38], incorporating the user behaviour information in the 
mapping process. The user behaviour corresponds to the application profile 
data, including the application periodicity in the system and data volume 
transferred among tasks. For real applications considering the user behaviour 
information, the approach achieved around 60% energy savings compared to 
a random allocation scenario. 

Holzenspies et al. [58] investigate a run-time spatial mapping technique 
with real-time requirements, considering streaming applications mapped onto 
heterogeneous MPSoCs. In the proposed work, the application remapping 
is determined according to information that is collected at design time 
(i.e., latency/throughput), aiming to satisfy the QoS requirements, as well 
as to optimize the resources usage and to minimise the energy consumption. 
A similar approach is proposed in Schranzhofer et al. [120], merging pre- 
computed template mappings (defined at design time) and online decisions 
that define newly arriving tasks to the processors at run-time. Compared to 
the static-mapping approaches, obtained results reveal that it is possible to 
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achieve an average reduction on power dissipation of 40-45%, while keeping 
the introduced overhead to store the template mappings as low as 1 KB. 

Another energy-aware approach is presented in Wilderman et al. [151]. 
This approach employs a heuristic that includes a Neighborhood metric 
inspired by rules from Cellular Automata, which allows decreasing the 
communication overhead and, consequently, the energy consumption imposed 
by dynamic applications. Lu et al. [85] propose a dynamic mapping algorithm, 
called Rotating Mapping Algorithm (RMA), which aims to reduce the overall 
traffic congestion (take in account the buffer space) and communication energy 
consumption of applications (reduction of transmission hops between tasks). 

In turn, Mandelli et al. [87] propose a power-aware task mapping heuristic, 
which is validated using a NoC-based MPSoC described at a cycle-accurate 
level. The mapping heuristic is performed in a given processor of the system 
that executes a preemptive operating system. Due to the use of a low level 
description, accurate performance evaluation of several heuristics (execution 
time, latency, energy consumption) is supported. However, the scope of the 
work is limited to small systems configurations due to the long simulation 
time. In the previous works, only one task is assigned to each processing 
core. A multi-task dynamic mapping approach was proposed in [128]. Singh 
et al. [128] extends the work described in [32], which evaluates the power 
dissipation as the product of number of bits to be transferred and distance 
between source-destination pair. 

Research in energy-efficient allocation for HPC and cloud systems is still 
incipient, with existing works addressing only the time and space fragmen- 
tation of resource utilisation at a very large granularity (server level), aiming 
to minimise energy by rearranging the load and freeing servers that are then 
turned off [12, 101]. 


1.3 Challenges 


While the approaches mentioned in the previous section have presented 
sophisticated resource allocation approaches that can provide performance 
guarantees and/or improve energy efficiency, there are still challenges that 
require more advanced resource allocation approaches. The following sub- 
sections briefly describe some of those challenges, which are precisely the 
ones addressed in this book. 


1.3.1 Load Representation 


Load models are internal representations used by allocation algorithms to 
evaluate different allocation alternatives. Such models may use informa- 
tion that is available a priori about the load (such as job dependencies, 
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communication volumes, worst case execution times), but can be also 
extended with information obtained during runtime (e.g., actual execution and 
communication times). In dynamic resource allocation, it is very challenging 
to define a load model that includes sufficient information about static and 
dynamic characteristics of the load, and that is lightweight enough to be used 
by allocation heuristics to quickly evaluate and compare alternative allocation 
possibilities during runtime. 

Chapter 2 addresses this challenge and presents a load model based on 
an interval algebra, aiming to allow quickly compose the load of multiple 
computation and communication jobs (represented as series of time intervals), 
enabling the evaluation of the impact of resource allocation (and thus resource 
sharing) on system performance and timeliness. 


1.3.2 Monitoring and Feedback 


In large-scale systems, obtaining updated information about the load during 
runtime is not trivial. Often, such information only makes sense when coupled 
with information about the underlying computation and communication plat- 
form. Furthermore, the costs of monitoring and transferring all such data to 
the resource allocation mechanism is already prohibitive. The major challenge 
in such scenarios is then to define a sufficiently meaningful set of metrics 
to monitor, and to design algorithms that can make meaningful resource 
allocation decisions based on the changes on those metrics over time. 

Feedback control algorithms have been used for decades to make decisions 
based on time-series data, so in Chapters 3 and 4 we describe possible uses of 
such closed-loop algorithms to support resource allocation. In Chapter 3, we 
show that they can be used to increase throughput and energy-efficiency in 
HPC and cloud workloads. In Chapter 4, on the other hand, we show that it can 
be used to efficiently perform admission control tasks, aiming to maximise 
system utilisation without jeopardising predictability in performance-sensitive 
HPC applications. 


1.3.3 Allocation of Modal Applications 


Allocation heuristics may have to guarantee hard real-time constraints to 
critical jobs. This is possible for applications that have been profiled a priori so 
their execution and communication patterns can be accurately represented by 
an accurate load model. Such applications will not be highly dynamic, and will 
exhibit modal behaviour, so that distinct modes of operation can be analysed at 
design time, so the dynamic allocation can be based on pre-defined alternatives 
(thus the number of allocation decisions during runtime is minimal). 

To address such scenario, modal allocation heuristics can guarantee hard 
real-time constraints by allowing different different allocations for each 
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operation mode while minimising the amount of remappings during mode 
transitions. Chapter 5 describes search-based heuristics that identify alloca- 
tions that are optimised for specific operation modes, but also for coping 
with dynamic mode changes. It uses automotive applications and Network- 
on-Chip platforms as case studies, and shows that it is possible to guarantee 
hard real-time constraints during each of the system’s modes as well as during 
transitions. 


1.3.4 Distributed Allocation 


In closed-loop systems, a centralised resource manager continuously receive 
feedback from the system so that it can have an up-to-date representation of its 
state. This usually comes with a significant communication overhead, specially 
in large-scale systems. Fully distributed approaches, on the other hand, offer 
higher scalability by relying on decision-making done by individual system 
components using only locally-available information. However, due to the lack 
of global knowledge, it is harder to achieve a reasonable level of performance 
predictability. 

Chapter 6 presents a bioinspired approach based on the notion of swarm 
intelligence, aiming to support a fully distributed approach to load remapping. 
It can be used on its own or in conjunction with centralised approaches, 
aiming to fine-tune allocation decisions based on up-to-date local data. A 
case study based on multi-stream video processing over Network-on-Chip 
platforms shows the strengths and weaknesses of such approach. 


1.3.5 Value-based Allocation 


Many of the quality metrics associated to resource allocation in HPC and 
cloud computing are platform-specific. For instance, metrics that are often 
used to formulate optimisation objectives (such as job execution times, 
communication volumes and throughput) are not comparable across different 
computational platforms. There are other metrics, however, that are com- 
pletely independent of the computational platform and relate instead to the 
requirements of the end-user. One of such metrics is the value of the completion 
of a job. This can be seen as a simple value, perhaps associated to a particular 
currency. More commonly, such value will be a function of time: the result of 
a job is very likely to lose value over time, and can even become worthless if 
it takes too long to be obtained. 

Chapter 7 addresses resource allocation heuristics that are designed to 
optimise such time-varying notion of value. It presents approaches that can 
be configured to rely more or less on load models obtained in advance, and 
shows how much can be gained in value if these models are available. 
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Load and Resource Models 


The efficient allocation of computational resources requires some under- 
standing of the resources themselves and their availability, as well as the 
load that must be allocated to them. Possibly under different names, the 
concept of resource and load modelling is commonly used in embedded, 
high-performance and cloud computing. For example, workflow models in 
HPC and task graphs in embedded computing are common ways to represent 
application load, while platform and resource models are used to represent the 
processing, networking and storage capabilities of the computer systems that 
run those applications. 

With the help of meaningful resource and load models, it is possible 
to evaluate the impact of different resource allocation techniques on the 
efficiency of resource usage and on application performance requirements. 
The more accurate the models, the better they can predict the performance 
of a computer system under a given load. On the other hand, dynamic and 
complex systems are harder to model accurately, so there is clearly a trade-off 
here. 

In real-time embedded computing, for example, it is common practice to 
constrain the execution of software to sporadic and bounded time intervals, 
and to disable advanced features of microprocessors such as out-of-order 
execution and caching, aiming to simplify the system’s behaviour and enable 
the creation of accurate load and resource models. At that level of accuracy, 
system designers can use such models to evaluate different resource allocation 
alternatives and identify the ones under which the system will never violate 
any of its performance guarantees, not even in a worst-case scenario. 

Such practice, however, requires a complete knowledge of the system 
resources as well as the load to be allocated to them. In many embedded 
systems, and in the large majority of high-performance and cloud computing 
systems, that is not the case. Therefore, recent modeling approaches have 
ways to represent load and resources under different levels of uncertainty. 
Stochastic models of the arrival and execution times of application-specific 
load or of the availability of computational resources, for example, are now 
commonly used to characterise average-case system performance. 


11 


12 Load and Resource Models 


2.1 Related Work 


In real-time systems, load models are often variations of the sporadic task 
model [42] or the time-triggered model [71], focusing explicitly on timing and 
on the repetitive nature of tasks (e.g., data from a sensor must be processed 
every 2 ms; a new gene sequencing job will be launched at least every 
millisecond) rather than the functional dependencies between them. 

Dataflow application models are usually untimed, and different tasks are 
synchronised by the data flowing through the system. Dataflows are usually 
modelled through graphs that represent the functional dependencies between 
tasks and some information about the nature of the data transfer. Many 
different dataflow models exist [134], with different types of constraints on 
the execution of tasks and communication aiming to allow different types 
of analysis (e.g., statically schedulable, time predictable, bounded commu- 
nication buffering). Many HPC workflows are also modelled as dependency 
graphs, often as directed acyclic graphs (DAGs) [144]. Such graphs can be 
annotated with estimations of execution time and communication volumes, 
which can be used to optimise resource allocation or implement resource 
reservation mechanisms [90]. Similarly, estimations of inter-arrival times, 
execution times and communication volumes can be modelled stochastically, 
allowing for a more general understanding of the characteristics of a given 
workload [51]. That approach can also be used to support the generation 
of synthetic application models that follow the specific characteristics of a 
realistic scenario. 

Advanced workflow management systems augment HPC and scientific 
computing workflow models with execution semantics [86], allowing such 
workflows to be analysed in similar ways as in time-triggered and dataflow 
models mentioned above. Finer granularity models are also used in HPC 
[10, 111], where application load is represented as a series of computation 
and communication bursts (often obtained from execution traces), but such 
models are too complex to be analysed and therefore are used only to drive 
abstract simulation. 

A number of application modelling approaches try to capture character- 
istics that are critical to specific domains. Within the automotive domain, 
a component-based software specification standard is established, called 
AUTOSAR (more in www.autosar.org). Within this standard, software com- 
ponents covering runnable entities can be defined by specifying interfaces, 
execution rates and timing constraints. However, AUTOSAR takes a conser- 
vative stand and does not allow the dynamic allocation of runnable entities to 
different computational units. 

Advanced approaches in application modelling supports the creation 
of hybrid models, i.e., models created using different underlying rules. 
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Ptolemy [47] is a modelling and simulation framework supporting hybrid 
application modelling for embedded systems using actor-orientation (a flexible 
model for representing concurrent behaviour). It supports different types 
of time-triggered and dataflow modelling approaches, among others, and is 
amenable to extensions to specific domains. 


2.2 Requirements 


Within the scope of this book, we use a model of load and resources that can 
cater to both worst-case and average-case system performance. It supports 
complete and accurate description of a system’s load and resources, but is 
also able to accommodate different levels of uncertainty by allowing stochastic 
descriptions of load. 

We therefore define a load model using the notion of jobs, which should 
represent the different parts of an application and, more specifically, the load 
each of those parts imposes on platform resources. This is a general-purpose 
model, aiming to have constructs that are flexible enough to represent multiple 
types of application components. For example, a job could represent the 
execution of a software task over a CPU, the transmission of a stream of 
data over a network, or the dynamic reconfiguration of an FPGA device. 

In order to model different types of applications, from embedded to HPC 
systems, such load models must be powerful enough to cover characteristics, 
e.g., functional properties, that are commonly found in such systems, as well 
as non-functional properties that can be used to evaluate the impact of different 
allocation mechanisms. In the subsections below, we present the requirements 
for such load modelling approach along four distinct categories: structure, 
temporal behaviour, resource constraints and load characterisation. 


2.2.1 Requirements on Modelling Load Structure 


Load models should be able to support multiple levels of abstraction, exposing 
more or less details of the application architecture according to the level 
of accuracy that is needed when evaluating the impact of a particular 
resource allocation mechanism. For instance, it may be useful to assume 
that all application jobs are completely independent, abstracting away their 
inter-communication, if the overheads due to data exchange are negligible. 
Therefore, the application structure denotes how an application can be broken 
in multiple jobs and how these jobs relate to each other. 

Regarding the application structure, we list requirements for a load 
modelling approach, so that the model is powerful enough to represent the 
most common types of applications. 


14 Load and Resource Models 


2.2.1.1 Singleton 
Ability to model applications that are composed of a single job. 


2.2.1.2 Independent jobs 

Ability to model applications that are composed of an arbitrary number of jobs 
that do not depend on or communicate with other jobs. It is assumed that jobs 
constantly have access to all information they need. 


2.2.1.3 Single-dependency jobs 

Ability to model applications that are composed of an arbitrary number of 
jobs that can depend on one and only one other job. Therefore, the application 
model must explicitly have the notion of dependencies between jobs. 


2.2.1.4 Communicating jobs 

Ability to model applications that are composed of an arbitrary number of job 
pairs. Intuitively, each pair includes a computing job and a communication job, 
but the strict definition of a communicating job should be a pair of dependent 
jobs that cannot be allocated to the same resource type (see requirements 
on resourcing in Subsection 3.3). This enforces the notion that, in this kind 
of application, communication can only be performed once the respective 
computation has completed. 


2.2.1.5 Multi-dependency jobs 

Ability to model applications that are composed of an arbitrary number 
of computation jobs, each of them depending on an arbitrary number of 
communication jobs, and also initiating an arbitrary number of communication 
jobs. The structure of this type of model constrains the application in such a 
way that the communication jobs initiated by a given computation job must 
not depend on computation jobs that depend directly or indirectly on their 
initiator (no cyclic dependencies). 


2.2.2 Requirements on Modelling Load Temporal Behaviour 


The temporal behaviour of the load defines the release of application jobs, 
i.e., when a job can actually be executed over a resource. Behaviours can be 
generally classified in time-driven (requirements 2.2.2.1, 2.2.2.2 and 2.2.2.3 
below) or event-driven (remaining requirements). 

Regarding application temporal behaviour, we list the following require- 
ments for a load modelling approach, so that it is powerful enough to represent 
the following types of application jobs. 
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2.2.2.1 Single appearance 
Ability to model an application job that is not part of a series, and is released 
at a specific point in time. 


2.2.2.2 Strictly periodic 

Ability to model an application job that is part of a series of jobs with release 
times separated by a constant time interval. If the release time of a job and 
its order within the series is known, the release time of all other jobs can be 
derived from it. 


2.2.2.3 Sporadic 

An application job that is part of a series of jobs with release times separated 
by a time interval that has a known lower bound. For every release of a job, it 
is therefore known that the release of the subsequent job of the series will not 
happen before that lower bound. 


2.2.2.4 Aperiodic 

An application job that can be released at any arbitrary time. It can be used 
to model event-driven systems where no assumptions can be made about the 
event sources. If an assumption can be made about the minimum time interval 
between successive events, such job series can be conservatively (but perhaps 
not accurately) modelled as sporadic jobs. 


2.2.2.5 Fully dependent 
An application job that is released immediately after the completion of all jobs 
that it depends on. 


2.2.2.6 N out of M dependent 
An application job that is released immediately after the completion of any N 
jobs out of all M jobs that it depends on, (M > N). 


2.2.3 Requirements on Modelling Load Resourcing Constraints 


The resourcing of applications defines which kind of resources a given job 
requires for its execution. This requires a taxonomy of resources over different 
types. The load model addressed here makes no assumption about such taxon- 
omy, and it may work under different typing systems (e.g., flat type hierarchy, 
single-parent type hierarchy, multiple-inheritance type systems), as different 
resource allocation mechanisms might benefit from them. Regarding this clas- 
sification, we list the following requirements for a load modelling approach, 
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so that it is powerful enough to represent the following types of application 
jobs. 


2.2.3.1 Untyped job 
Ability to model an application job that can be executed on any type of 
resource. 


2.2.3.2 Single-typed job 
An application job that must be executed over a specific type of resource. 


2.2.3.3 Multi-typed job 
An application job that can be executed over multiple types of resource. 


2.2.4 Requirements on Modelling Load Characterisation 


The characterisation of the application load defines how long each of its jobs 
uses the resources they were allocated. Regarding this classification, we list the 
following requirements for a load modelling approach, so that it is powerful 
enough to represent the following types of application jobs. 


2.2.4.1 Fixed load 

An application job that always occupies a resource for a constant amount of 
time, regardless of the resource. The load of such a job can be characterised 
by a scalar. 


2.2.4.2 Probabilistic load 

An application job that occupies a resource for a probabilistic amount of time, 
regardless of the resource. The load of such a job is a random variable, and 
can be characterised by a histogram or a probability density function. 


2.2.4.3 Typed fixed load 

A multi-typed application job that occupies resources of different types by a 
potentially different, yet constant amount of time. The load of such a job can 
be characterised by a vector of scalars, and the length of the vector is equal to 
the number of types of resources that the job can occupy. 


2.2.4.4 Typed probabilistic load 

A multi-typed application job that occupies resources of different types with 
a potentially distinct stochastic behaviour on each of them. The load of such 
a job can be characterised by a vector of probability density functions or 
histograms, and the length of the vector is equal to the number of types of 
resources that the job can occupy. 
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2.3 An Interval Algebra for Load and Resource Modelling 


Within this book, we will rely on a novel approach to load and resource 
modelling based on an interval algebra (IA). It will be used throughout the 
book to ease our understanding of the impact of different resource allocation 
mechanisms. But more importantly, it can be used by the resource allocation 
mechanisms themselves as an internal representation of the resources and the 
load that they are supposed to manage. 

Our IA represents non-functional characteristics of application load using 
the mathematical concept of intervals. It can be used to analytically derive 
the impact of using different resource allocation policies on the original 
application characteristics. The main concerns of this book are performance 
and time predictability, so most of our examples focus on the representation 
of time intervals, but the interval algebra can naturally be extended to support 
other non-functional properties such as energy dissipation. 

In a simplistic example, we can consider an application with three jobs 
A, B and C, and a homogeneous platform composed of two processors with 
first-come-first-serve scheduling. Each of the jobs can be represented by an 
interval that denotes the time they need to run: A = [0,30[, B = [0,45], 
C = [0,20] (assuming in this example that they are all independent and ready 
to run at time = 0). By using simple interval algebra operations, a resource 
allocation heuristic can estimate the response time R of the three tasks under 
different allocation schemes (e.g., Ra = 30, Rg = 45 and Rc = 50 if A and 
C are allocated, in that order, to one of the processors and B is allocated to 
the other), and thus can dynamically decide whether it is likely to meet the 
applications constraints when using a given allocation. 

While trivial, such example can be made arbitrarily complex by allowing 
different resource scheduling disciplines, a larger number of tasks and pro- 
cessors. For the interval algebra, however, the analysis of the response times 
under a specific allocation would still involve the application of the same 
interval manipulation rules. 

The advantages of such an approach are numerous, including the 
following. 


e It enables dynamic allocation mechanisms to have an appropriate level 
of confidence on whether the chosen allocation meets the applications’ 
timing constraints. 

e The approach can be used as a fitness function of search-based allocation 
heuristics, if the algebraic operations are sufficiently lightweight as they 
have to be applied over a potentially large search space (some examples 
of integrating IA to genetic algorithms are provided in Chapter 5). 

e The solution of algebraic operations can be found in multiple ways, with 
different levels of performance. Therefore, resource allocation heuristics 
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can be improved simply by optimising the solution of the employed 
algebraic operations. 

e If absolute predictability is not required (i.e., in soft real-time and best- 
effort applications), algebraic operations can be solved faster by applying 
approximations that sacrifice the accuracy of the final result. This enables 
allocation mechanisms that can be applied to systems with different levels 
of strictness of their timing requirements. 


Let us now introduce the main principles behind this interval algebra. Our 
goal in this book is not to be overly formal, so we will favour intuitive 
descriptions over mathematical formalism whenever possible (1.e., without 
sacrificing precision). In general terms, an algebra is a definition of symbols 
and the rules for manipulating those symbols. Our interval algebra, therefore, 
establishes rules for the manipulation of intervals. It defines different types of 
intervals, which represent the amount of time a particular piece of application 
load requires from a notional resource. For example, a single job can be 
represented by a time interval using the notation below: 


#A#0#40 (2.1) 
where the first element of the tuple is a unique job identifier, the second is a 
non-negative real number representing the release time of the job and the third 
is a positive real number representing the job’s load, i.e., the actual length of 
the time interval. In the example above, job A is released at time 0 and requires 
40 time units of a resource. The same concept can also be represented using the 
mathematical notation for a left-closed right-open bounded interval [0, 40]. 
Following the definition above, our LA must also define rules for manipula- 
tions of such intervals: what happens when an interval is allocated to a specific 
type of resource, what if two intervals are allocated to the same resource, etc. 
Widely used algebras define a small number of basic operations (e.g., addition, 
multiplication) and then define more complex operations as composites of 
those basic operations (e.g., matrix multiplication). Our IA defines two basic 
algebraic operations: time displacement and partition. Time displacement 
changes the endpoints of an interval by an arbitrary value t, and denotes that 
the job has to wait for its allocated resource (i.e., its starting and ending times 
were moved ¢ time units to the future). Partition simply breaks one interval in 
two, and denotes that a job was preempted from a resource (and the second 
interval produced by the partition is likely to be time-displaced). All other 
interval-algebraic operations of IA, which can represent an arbitrarily large set 
of allocation and scheduling mechanisms, can be expressed as compositions 
of these two. By applying these operations, it is possible to investigate the 
impact of different resource allocation and scheduling mechanisms on the 
endpoints of the intervals, which in turn denote the completion times of each 
application component. 
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In the following subsections, we show how our IA addresses the 
requirements described in Section 2.2. 


2.3.1 Modelling Load Structure 


The interval-based representation of a job presented above is sufficient to 
express a singleton. By using a set of such intervals, independent jobs can be 
also represented. To denote a dependency between two tasks A and B, the 
notation can be extended to include a job identifier instead of the release time 
of a job: 

# BH A##50 (2.2) 


This notation is capable of denoting single dependency jobs, and conveys 
that interval B’s start-point depends on interval A. Multiple dependencies can 
also be specified as a dependency set, and thus multi-dependency jobs can be 
covered: 


#C#{ A, B}4#260 (2.3) 


This notation assumes that whenever an interval has dependencies, its start- 
point lies exactly at the highest endpoint among all the intervals it depends 
on. In this example, assuming that jobs A and B are defined as in examples 
(2.1) and (2.2), this leads to: A = [0, 40), B = [40, 90), C = [90, 350). 


2.3.2 Modelling Load Temporal Behaviour 


The intervals described in the previous subsection are single-appearance and 
have a fixed release time, therefore express singleton jobs. A strictly periodic 
series of jobs can be characterised by its release time, the period after which 
a new job is released, and the time interval each job requires from a notional 
resource. We denote such job series with the notation exemplified below, which 
is exactly the same as the notation of a singleton task followed by the period: 


#: D#0#40#100 (2.4) 


Mathematically, it represents an infinite series of intervals, such as: D = 
[0, 40), [100, 140), [200, 240),.... This extension is expressive enough to 
represent strictly periodic tasks. 

The release time of sporadic tasks is not deterministic but has well defined 
bounds. In case of aperiodic tasks, those bounds do not exist. To model those 
cases, IA represents release times with aleatory variables. Those variables are 
associated with probability distributions that can constrain assumed values. 
We will cover that approach in Section 2.3.5 when we discuss intervals with 
stochastic representations of time. 
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2.3.3 Modelling Load Resourcing Constraints 


IA represents a resource as the dimension over which jobs are operated upon. 
Jobs, each represented by its respective interval, are allocated onto a resource; 
algebraic operations determine how the resource is shared between all of them, 
and how the resource sharing affects their timings. We denote a resource with 
the notation exemplified below: 


+Z: (#A#0#40) (2.5) 


where the algebraic operation +Z; is applied to the set of intervals surrounded 
by brackets (only A in the example above). The example below shows the same 
resource, but this time with two distinct jobs mapped to it: 

+ Z(#A#0#40, #B#0#50) 

+ Z1(#A&40, #B&90) = (2.6) 

+ Z1([0, 90)) 


In this example, we introduce two different ways to evaluate the operator + Z1 
(which we can intuitively understand as a resource serving jobs under a FIFO 
schedule). The first evaluation of the operator preserves the identities of the 
mapped jobs, and it indicates the completion times of each one of them after the 
symbol “82”. We will refer to this type of evaluation as information-preserving 
(or simply preserving). The second way to evaluate the operator is equivalent 
to the first, but it does not preserve any information about the individual 
operands. It simply determines the busy period(s) of the resource with one or 
more intervals. We refer to this type of evaluation as information-collapsing 
(or simply collapsing). 

Comparing with elementary algebra, the two evaluations of the operator 
are akin to solving an expression like (3 + 5) + (1 + 2) using an intermediate 
step 8 + 3 before arriving to the final result 11. In both algebras, there is an 
infinite set of possible operands that could lead to a particular result, and there 
is no information in the final result that could allow the backtracking of the 
initial operands. 

A slightly different example is shown below, using the same jobs but this 
time mapped onto resource + Z2 that uses a time-division multiplexing (TDM) 
scheduler with a quantum of 8 time units: 


+ Zo(#A#0#40, #B#0#50) 
+ Zo(++A8272, #B&90) = (2.7) 
+ Za([0,90)) 


It is worth noticing that only the intermediate expression (i.e., after the pre- 
serving evaluation) differs, and the final result after the collapsing evaluation 
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is the same. This is always the case if the operand denotes a work-preserving 
scheduler (i.e., a resource is never idle if there are jobs ready to be served). 

The two following examples show jobs allocated to a resource that is 
shared under a priority-preemptive scheduler, assigning priorities in the same 
order the jobs are passed to the operator (higher to lower): 


+ Za (#C#15#40, 74 DH 10750, #E#0#50) 
+ Z3(#C&55, #D&100, #E&140) = (2.8) 
+ Z3([0, 140)) 


+ Za HHS, HG HOPS, HH #2645, 7; 14524758) 
+ Za(AFE14, #G&22, #H&31, 41837) = (2.9) 
+ Z4((0, 22), (24, 37)) 


In both cases, the algebraic operations abstracts away the specific interleaving 
patterns of the execution of every job. Each of the evaluation types focusses 
solely on, respectively, the finish times of each job or the idleness of the 
resource. For example, formula (2.9) represents the following: job G starts 
to be executed at time zero, but after 10 time units it is preempted by job F 
which runs to completion for 10 time units; then G resumes and runs for its 
remaining execution time until time equals 22 units; resource Z4 becomes idle 
until job / is released at 24 time units, which in turn suffers a preemption from 
H between times 26 and 31 units and then executes until time equals 37 units 


Just like single appearance jobs, periodic jobs can be allocated to resources: 
+ Z;(A#0#40#-100, #B #0450) 
+ 21(#A&40, #B&90, #A#100#40# 100) (2.10) 
+ Zu ([0, 90), #A#100#40#100) 


It is important to notice that a periodic job series always remains as a distinct 
interval in the result of both preserving and collapsing evaluations of an 
operator. This reflects the infinite nature of the series. 

One of the crucial properties of a job is its affinity, which means that it can 
be served only by the designated resources. The job that can be executed on 
any resource available in a system is referred to as untyped job. If a job can be 
executed on a single type of resources only, it is a single-typed job. A multi- 
typed job can be executed on a few (enumerated) resource types, possibly 
with different execution times on each of them. In all examples so far, only 
untyped jobs have been used. To describe a single-typed or multi-typed job, 
the notation should support the definition of different types of resources and 
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different types of resource affinity. This can be expressed as follows, where 
each scalar in pointy brackets denotes a different type and the absence of type 
constraints implies untyped jobs or resources (as in examples above): 


+ Z5 (2) (J (2) 4403515, #K (2, 3, 8) #0720, #LAO#14) (2.11) 


By allowing the definition of resources types and resource requirements, it is 
also possible to present communicating jobs by modelling the job as two fully 
dependent intervals with distinct resource requirements, one for computation 
and one for communication (i.e., the job can only communicate over resource 2 
once it has finished being computed over resource 1): 


HL(1) #04614 


#M (2) 4 L 44340 na 


2.3.4 Modelling Load Characterisation 


The representation of load as the interval length, denoted by a positive real 
number (as defined in Subsection 2.3.1), is already capable of representing a 
fixed load. 

To represent a typed fixed load, we allow the specification of different 
interval lengths for different resource types using a similar notation as the one 
introduced at the end of Subsection 2.3.3: 


#N (2, 4,6) 404 (10, 20, 20) (2.13) 


To represent a probabilistic load or typed probabilistic load, we have to rely 
on aleatory variables to represent the load. This can be done for both typed 
and untyped jobs. 


2.3.5 Stochastic Time 


In many cases, it may be desirable to represent intervals with non-deterministic 
temporal behaviour or load characterisation. In these cases, IA allows the use of 
aleatory variables, which follow a probability distribution, instead of scalars. 
It does not impose any limitation on the choice of probability distributions, 
and their parameters should be provided following a well established notation. 
For example, a normal distribution N (u, o?) with parameters mean pp = 2 
and variance 0? = 1, N(2, 1) can be used to denote the release time of job 
P, and similarly N (40, 1) can denote its execution time: 


#P#normal(2, 1)4#normal(40, 1) (2.14) 


The time when job P finishes its execution is described by the convolution of 
two Gaussians: N (2,1) x V’(40, 1). 
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Figure 2.1 Example of a probability mass function of a discrete random variable describing 
a job's execution time. 


It is particularly convenient to represent time as a discrete random variable 
described by a probability mass function (PMP), i.e., a function giving the 
probability that a discrete random variable is equal to a provided value. 
All values of a PMF should be non-negative and sum up to 1. Using this 
function for describing a job’s execution time, the best-case execution time 
(BCET) and the worst-case execution time (WCET) correspond to the first 
and the last probability value of the distribution, respectively. Between these 
two extremes, the probabilities of the remaining possible execution times are 
described. For example, the PMF of a job execution time whose BCET and 
WCET equals to 10 ms and 13 ms is shown in Figure 2.1. This job can be 
described using IA as: 


#Q#0#4pmf (10, 0.4), (11, 0.2), (12, 0.3), (13, 0.1) (2.15) 


To show how tasks with stochastic timing can be mapped to a resource, we 
use job T' whose release time is described by discrete uniform distribution 
U{0, 1} and is executed in 40 time units, and job U depending on T executed 
inU/{1, 4} time units by a notional resource +Z; with FIFO scheduling. Then: 


+ Z1(#T#pmf (0, 0.5), (1, 0.5) #40, 
#U#T#pmf (1, 0.25), (2, 0.25), (3, 0.25), (4, 0.25)) = 
+ Zi([pmf(0,0.5), (1, 0.5), pm f (41, 0.125), (42, 0.25), 
(43, 0.25), (44, 0.25), (45, 0.125))) 


(2.16) 


2.4 Summary 


This chapter presented the requirements for workload and platform models 
that are suitable to support resource allocation mechanisms in embedded, 
high-performance and cloud computing. Such models can be used as internal 
representations, allowing resource allocation mechanisms to evaluate different 
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allocation alternatives. A specific modelling approach has been introduced, 
based on an interval algebra, which fulfils the listed requirements and is 
amenable to compact and efficient implementations. A reference implemen- 
tation of the presented algebra is available from the DreamCloud project 
website!. 


“http://www.dreamcloud-project.org 
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Feedback-Based Admission Control 
Heuristics 


Applying feedback mechanisms to monitor the capacity of computing 
resources and quality-of-service (QoS) levels can guarantee a bounded time 
response, stability, bounded overshoot even if the exact knowledge of a system 
workload and service capacity is not available a priori [2]. Thus, in case of 
a careful fine-tuning of parameters, they can be successfully applied even 
to systems with real-time constraints (see the Related Work section). It was 
verified that this approach helps to find a trade-off between multiple objectives 
of a workflow management system, e.g., minimal slacks and maximum core 
utilisation [53]. 

The feedback-control dynamic resource allocation heuristics impose some 
requirements on the target system, which should guarantee that the appropriate 
input data is available and that the generated output can be used to perform the 
proper resource allocation. Usually to perform a resource allocation decision 
we can rely on various metrics, provided by the monitoring infrastructure 
tools and services, such as utilization and the time latency between input 
and output timestamps [81]. The system should also guarantee an appropriate 
level of responsiveness to the decisions made by the heuristics, as well as 
update the values of the metrics used as inputs in the algorithm frequently 
enough for the particular application. The platform should support scheduling 
on distributed-memory infrastructure resources. It is important to provide the 
heuristic algorithm with realistic data about system workload, service capacity, 
worst-case execution time and average end-to-end response time [84]. 

The task mapping process presented in this chapter is comprised of 
the resource allocation and task scheduling. The technique proposed in this 
chapter assumes the presence of a common task queue, which is used by the 
global dispatcher. The resource allocation process is executed on a particular 
processing unit. Its role is to send the processes to be executed to other 
processing units, putting them into the task queue of a particular core. 
The process dispatching, i.e., selecting the actual process to run, is also a 
part of the scheduling algorithm and is carried out locally on each core. It is 
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assumed that task scheduling is performed in a non-preemptive early deadline 
first (EDF) or first-in-first-out (FIFO) based manner. 

Later in this chapter we propose an algorithm to map firm real-time tasks 
into multi-core systems dynamically, using dynamic voltage and frequency 
scaling (DVES) to decrease energy dissipation in cores. According to simula- 
tion results, the proposed method leads to more than 55% of dynamic energy 
reduction. 


3.1 System Model and Problem Formulation 
3.1.1 Platform Model 


The controlling process of dynamic behaviour of a target system can be 
performed in two ways: feed-forward and feed-back, presented in Figure 3.1. 
Although the closed-loop scheme includes larger number of functional blocks 
and requires measuring output values, it requires less accurate model of the 
target system and is also more resistant to disturbances [7]. A closed-loop 
system is characterised with a feedback loop, which carries values of measured 
output (y(t), aka controller value). These values are subtracted from their 
desired value (r, reference signal, setpoint). The result of this operation forms 
error (e(t)) signal, which is used to compute control input (u(t)). This value 
is sent back to the target system. 

A proportional-integral-derivative (PID) controller is particularly often 
used in various industrial control system, recently including computing 
systems [57]. 

The PID controller in the time-domain form is described in the following 
way: 


u(t) = kpe(t) + nf e(7)d7 + kač elt). (3.1) 


r u(t) y(t) 
—, Controller Target system 
e(t) u(t) y(t) 
o 


Figure 3.1 Block diagrams of control system architectures: feed-forward (above) and 
feedback (below). 
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The determination of proportional (kp), integral (k;) and derivative (ką) 
constant components of PID controller is known as PID controller tuning. 

The PID controller is often presented in an equivalent form in the frequency 
domain, where function (3.1) of time t is presented as a function of complex 
frequency s using the Laplace transform, leading to 


k; 
K(8) = kp + = + kas. (3.2) 


A PID controller is often described using other constant parameters: k — so 
called proportional gain, T; — integral time constant and Tọ — derivative time 
constant 


1 
K(s)=k (1 + T; + sT,) (3.3) 


Since increasing the value of parameter kg enhances noise, the derivative 
component is often omitted in numerous practical applications [57]. It is also 
not used in the work described in this chapter despite its positive influence on 
stability or speed. 

In Figure 3.8, a general view of the proposed architecture is presented. 


3.1.2 Application Model 


We consider a workflow of a particular structure. There is no dependencies 
between tasks and the deadline of each task computation is set as a sum of its 
computation time multiplied by an arbitrary value and arrival time. There is 
only one priority of task; tasks cannot be preempted during their execution. 
During simulation we measure cluster core utilisation, which is the percentage 
of cores in the clusters executing tasks in particular simulation time t. 


3.2 Distributed Feedback Control Real-Time Allocation 


After releasing task t;, the role of the dispatcher is to decide which of the 
clusters C}, j = 1,...,m, is to execute the task. This decision can be made 
using various metrics, we decide to apply a choice of the cluster whose cores 
are currently the most idle. If more than one core satisfies the chosen condition, 
one of them is chosen randomly. For the comparison purpose we also allowed 
the dispatcher to choose the target cluster Cj in the round-robin manner. The 
task t; is then placed in the j-th queue. 

Each j-th cluster includes one admission control block, AC;. Its role is to 
decide whether a task t;, read from the j-th input queue, should be executed 
by the cluster. The first condition of admittance is that the deadline of t;, D;, is 
not lower than the sum of its computation time, C; and the current simulation 
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time t. Then the input value from the controller, u;(t), is tested. If this value 
is positive, the task is admitted, otherwise it is rejected. Admitted tasks are 
placed in the internal cluster queue. This queue is planned to be rather short 
to minimise the delay between decision about admittance and the execution 
of the task, and to keep the timeliness of the lateness input. 

To control the admittance in each cluster, we use discrete-time controllers 
in two variants. The first of them is a PI (i.e., a PID controller without the 
derivative component) whose controlled value is an average lateness of a 
(parameterisable) number of previous tasks computed by the cluster cores, 
where lateness is defined as the difference between a task response time and 
its deadline. If a lateness is negative, the task has been finished before its 
deadline, and positive otherwise. The current value of lateness is compared 
with the setpoint, r, and an error e;(t) is computed. It is provided as an input to 
a controller, which computes admittance allowance value u,(t). The second 
variant includes a P controller (i.e., a PID controller with the proportional 
component only) whose output value u; (t) depends on the difference between 
the current core utilisation and the setpoint. The output value of u; (t) is sent to 
AC, where it is used to perform a task admittance decision. In both situations, 
as long as value of control input u;(t) is positive, the task is allowed to be 
submitted to cores, otherwise it is rejected. The admitted tasks are placed in 
the queue. 

An idle core Core; k, j =1,...,m,k =1,...,n, fetches a task t; from 
the j-th core queue and then executes it in a non-preemptive manner. After 
execution, the lateness of the i-th task, L; = D; — t, is computed. Each core 
also informs the observer whether it is occupied or idle. 

The role of observer Monitor; is to compute two metrics based on the 
performance of all cores in the 7-th cluster. The first metric is core utilisation 
and the second metric is an average lateness of the previous q tasks computed 
by the cores in the 7-th cluster. These data are provided to the j-th controller 
and the dispatcher. 


3.3 Experimental Results 
3.3.1 Controller Tuning 


In order to check the efficiency of the proposed feedback-based admission 
control and real-time task allocation, we developed a simulation model using 
SystemC language. We firstly configured it to operate in the open-loop manner. 

In Figure 3.3 we present the maximum task lateness in the open-loop 
system consisted of three clusters, each including three cores. In every 
situation, at 5,000 ns a number of tasks, ranging from 5 to 500, each requiring 
execution time equal to 50,000 ns, has been generated. Then we looked at the 
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Figure 3.3 Maximum normalised task lateness (with execution time equal to 50,000 ns) in 
step responses for a number of tasks (3 clusters, 3 cores in each). 


maximal task lateness, where each lateness has been normalized by dividing 
it with the deadline. 

In order to tune the controller, we analysed the step-input maximum 
normalised task lateness response in the open-loop system. As an input we have 
used a burst release of 500 tasks (with execution time equal to 50,000 ns) at 
5,000 ns. The system was comprised of 3 clusters, each including 3 computing 
cores. The obtained result confirms the accumulating (or integrating) nature 
of the process, which can be described by the following model [64]: 


F(s) = e (3.4) 
where 7 is the dead time, i.e., the delay between changing input and the 
observable output reaction, and V is the velocity gain, which is the slope of 
the asymptote of the process output. 

In such kind of processes, to choose proper values of PI controller 
components, AMIGO (Approximate M-constrained Integral Gain Optimisa- 
tion) tuning formulas can be applied [7]. According to these formulas, the 
parameters k and T; are equal: 


0.35 
te = 
6.5) 
Ti = 13.357. (3.6) 


Both these parameters can be determined using the step output illustrated in 
Figure 3.4: k = 0.2741 and T; = 1108. 
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Figure 3.4 Maximum normalised task lateness step response for 500 tasks (with execution 
time equal to 50,000 ns) released at 5,000 ns (3 clusters, 3 cores in each). 


The usage of the core utilisation in a cluster as a controlled value is a bit 
more tricky due to its non-linearity. Because of the obvious saturation at 100 
per cent (see Figure 3.5) in the case of step response, to compute parameters 
of a controller we limit the considered operating region to the proportional 
range before the saturation, which ranges from 1 to 4 ns. Its maximum slope 
tangent can be described by linear formula y = 0.33 x — 0.33. According to 
classic Ziegler-Nichols method [64], the k parameter of the P controller can 
be computed as 
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Figure 3.5 Cores utilisation step response for 500 tasks (with execution time equal to 50,000 
ns) released at 0 ns (1 cluster with 3 cores). 
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where A is the absolute value of the y-coordinate of the intersection of the 
max slope tangent with the OX axis. In our case A = 0.33 and, consequently, 
k = 3. 


3.3.2 Stress Tests 


The workload used in our introductory experiment consists of 900 independent 
tasks, one released every 5,000 ns, whose computation time equal to 50,000 
ns and deadline is set to the sum of computation time multiplied by 1.2 and the 
task release time. In Table 3.1, the number of rejected tasks, tasks executed 
before and after their deadlines in various controlling environment settings is 
presented. 

The two first rows present the result obtained in the open-loop systems. 
The choice of the external queue has only slight influence on the number of 
tasks executed before their deadlines. In these situations task can be rejected 
by the admission control only if the task slack (computed as D; — C; — t) is 
negative. 

Applying a closed-loop approach improves the system performance sig- 
nificantly, but the proper choice of the measured output is also essential. In 
case of the core utilisation not a single task finishes after its deadline, as the 
tasks are submitted to the queue only when there is at least one idle core, 
that can start executing the task instantly. The lateness is less correlated with 
the real temporal availability of computational power, so as many as 116 tasks 


Table 3.1 Number of rejected tasks, tasks executed before and after their deadlines in various 
controlling environment configurations for a periodic task workload simulation scenario 
(3 clusters, 3 cores in each): configuration parameters (above) and obtained results (below) 


Config. No. | Architecture Queue Controller Value Controller Allocation 


1 Open-loop FIFO - - min core util. 
2 Open-loop EDF - - min core util. 
3 Closed-loop Both core utilisation P min. CPU util. 
4 Closed-loop Both lateness PI min. CPU util. 
5 Closed-loop Both core utilisation P RR 

6 Closed-loop Both lateness PI RR 


Config. No. | Tasks before Deadline Tasks after Deadline Tasks Rejected 


1 149 661 90 
2 154 655 91 
3 738 0 162 
4 614 116 170 
5 675 0 225 
6 607 127 166 


3.3 Experimental Results 33 


have been sent to the queue despite not a single core was capable of computing 
the task before its deadline. 

To asses the improvement of the core-utilisation-based allocation, we 
performed simulations where the tasks are allocated to clusters in a round- 
robin manner. In both P- and PI-based architectures we obtained worse results 
by 8.5 and 1.14 per cent, respectively. Importantly, the higher improvement 
has been observed in the architecture leading to the overall better results. 

The clusters’ cores utilisation for this architecture during the first 500 ns 
is presented in Figure 3.6. Except for the initialisation (and finalisation, not 
shown in the figure) there is no time when any of the clusters has less then 66 
per cent of the core utilisation. After computing the average core utilisation 
during the whole simulation time we get 90.12%, 90.20%, and 90.16% for 
the first, second and the third core, respectively. The tasks have been sent 
by the dispatcher to the first cluster 296 times and to the 2nd and the 3rd 
core respectively 292 and 313 times, which can be viewed as a quite even 
distribution. 

The control signal (generated by a controller, sent to the admission control) 
for the first 500 ns of the simulation is shown in Figure 3.7. The positive value 
of this signal means that at least one core from the given cluster has finished 
the previous task computation and is being idle. In this situation the next task 
should be submitted to the cluster as soon as possible. 

Similar results have been observed in other workloads of a periodic nature 
with uniform (or nearly uniform) execution time. 
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Figure 3.6 Core utilisation measured during the first 500 ns of the simulation. 
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Figure 3.7 Control signal observed during the first 500 ns of the simulation. 


3.3.3 Random Workloads 


In our next experiment, summarised in Table 3.2, we analysed 30 randomly 
bursty workloads, generated according to the method described in [22], 
including from 827 to 962 tasks of diverse execution time, ranging from 
1 to 2,67,582 ns. Three target system configurations have been checked: 
open-loop with EDF and closed-loop with CPU utilisation as the controller 
value, where the allocation is performed using the minimal core utilisation 
metric and in the round-robin way. Once again, the closed-loop approach 
leads to better results but, in comparison with the periodic-task scenario in the 


Table 3.2 Total number of rejected tasks, tasks executed before and after their deadlines in 
various controlling environment configurations for 30 random bursty task workload simulation 
scenarios (3 clusters, 3 cores in each): configuration parameters (above) and obtained results 
(below) 


Config. No. | Architecture Queue Controller Value Controller Allocation 
1 Open-loop EDF - — min. core util. 
2 Closed-loop Both core utilisation P min. core util. 
3 Closed-loop Both core utilisation P RR 


Config. No. | Tasks before Deadline Tasks after Deadline Tasks Rejected 


1 10603 1752 14059 
2 12296 753 13365 
3 11946 675 13793 
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experiment described above, the improvement, equal to about 16%, is slightly 
less impressing. Similarly, the difference between allocating task under the 
minimal core utilisation criteria and round-robin is rather slight and equals 3 
per cent. It is worth stressing, however, that the tasks in the analysed workloads 
are characterised with very diverse time of computations, but despite this 
variance they are not differentiated by our model. Consequently, one task can 
occupy a core for longer time, not allowing other (submitted a bit later) tasks 
to be executed on this core because of the lack of preemption. 


3.4 Dynamic Voltage Frequency Scaling 


Dynamic Voltage Frequency Scaling (DVFS) is a power saving technique, 
omnipresent in CMOS circuits, benefiting from the fact that their dynamic (or 
switching) power P is proportional to the the square of core supply voltage V, 
and its clock frequency f, i.e., Pax fV?. Since any reduction of core voltage 
requires an adequate decrease of the clock frequency, some trade-off between 
energy savings and computation performance is achieved. Some guidance in 
real-time systems stems from the fact that there is usually no additional benefits 
from faster task execution as long as it is before the deadline. Moreover, for 
typical workloads the required peak computational performance is usually 
much higher than the average [106]. Thus sustaining a lower voltage/frequency 
for most of time and increasing it only when required by a workload growth, 
in a way it risks missing some deadlines, seems to be a sensible strategy. To 
perform a proper voltage scaling decision, it is possible to rely on various 
metrics, provided by the monitoring infrastructure tools and services, such 
as utilization and time latency between input and output timestamps [81]. 
In multiprocessor domain, the cores can operate on different voltage at a 
given instant, so allocating a task to the most suitable core starts to be a more 
sophisticated task even in case of homogeneous cores, since assigning a task 
to a core with lower voltage can lead to missing the deadline that would be 
met in case of a different decision. The term voltage scheduling has been 
introduced to refer to scheduling policies using DVFS facility to improve 
energy efficiency. 

In Multiprocessor Systems on Chips (MPSoCs) a task can be mapped 
to a core either statically or dynamically, just before its execution, which 
is particularly beneficial in case of workloads not known a priori [123]. In 
DVFS-based systems, the problem of dynamic task mapping is even more 
difficult, since not only resource utilisation and application structure have 
to be analysed, but also the present voltage level of each processor needs 
to be considered. Modern operating systems, including both Windows and 
Linux (2.6 Kernels and above) support dynamic frequency scaling for systems 
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with Intel (SpeedStep technology) and AMD (PowerNow! or Cool‘n’ Quiet 
technology) processors. Frequency levels in these chips are not continuously 
available, but a limited number of discrete voltage/frequency levels is offered. 
They follow the Advanced Configuration and Power Interface (ACPI) open 
standard, defining such processor states as CO (operating state), C1 (halt), C2 
(stop-clock), and C3 (sleep). In C states with higher numbers less energy is 
consumed, but returning to the normal operating state imposes more latency. 
In some device families additional C-states have been introduced, such as 
C6 in Intel Xeon when an idle core is power gated and its leakage is almost 
entirely reduced to zero [52]. While core is in the CO state, it operates with 
one of several power-performance states, known as P-States. In PO, a core 
works with the highest frequency and voltage level, and subsequent P-States 
offer less performance but also require less energy. The most recent ACPI 
specification can be found at Unified Extensible Firmware Interface Forum!. 

In operating systems, frequency scaling depends on an applied governor. In 
case of Linux, the ondemand governor switches frequency to the highest value 
instantly in case of high load, whereas the conservative governor increases 
frequency gradually [77]. These policies aim to keep processor utilization 
close to 90%, progressively decreasing or increasing frequency using heuris- 
tics [102]. This approach may, however, negatively impact applications with 
timing constraints. To overcome this limitation, a custom governor can be 
developed and applied. These governors can then operate on per-core and 
per-chip basis, taking into account utilisation of other machines in a cluster, 
etc. A valuable comparison between per-core and per-chip DVFS is presented 
in [68], where per-core DVFS is shown to offer even more than 20% energy 
savings in comparison with the conventional chip-wide DVFS with off-chip 
regulators. However, per-core DVFS is rarely implemented and, for example, 
all active cores in contemporary Intel i7 processors must operate with the 
same frequency in the steady state, whereas AMD processors allows their 
cores work with different frequencies, but one voltage value, appropriate to 
the core with the highest frequency, is to be provided to all the cores [52]. 

In the remainng part of this chapter, we propose a custom governor 
algorithm for per-chip DVFS. The algorithm performs dynamic resource 
allocation and assumes the presence of a common task queue, which is used by 
a global dispatcher. The resource allocation process is executed on a particular 
processing unit, whose role is to send the processes to be executed to other 
processing units, putting them into the task queue of that core. The task 
dispatching, i.e., selecting the actual task to run, is also a part of the mapping 
algorithm and is carried out locally on each processor. It is assumed that task 
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scheduling is performed in a non-preemptive first-in-first-out (FIFO) based 
manner for simplicity, but another scheduler can be used instead. 


3.5 Applying Controllers to Steer DVFS 


In Figure 3.8, a general view of the proposed architecture is presented, where 
dashed lines are used for steering P-States. We consider workflows of a 
particular structure. All tasks are assumed to be firm real-time, so certain 
number of missing deadlines is allowed, but the task executed after its deadline 
is invaluable to the user. There are no dependencies between tasks and all 
tasks have equal priorities. Further, tasks cannot be preempted during their 
execution. 

After releasing task T;, the role of the dispatcher is to decide which of the 
processors Processor;,j = 1, . . . , m, is to execute the task. This decision can 
be made using various metrics. We measure processor core utilisation, which 
is the percentage of busy cores in the processors executing tasks in particular 
simulation time t, and choose the processor whose cores are currently the least 
utilized. If more than one processor have the same lowest utilisation, one of 
them is chosen randomly. The task T; is then placed in the j-th external queue. 

To control the admittance in the j-th processor, we use a discrete-time 
PI controller (i.e., a discrete-time PID controller without the derivative 
component) whose output value u;(t) depends on the difference between 
the current core utilisation and the setpoint. The output value of u;(t) is sent 
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Figure 3.8 Distributed feedback control real-time allocation with DVFS architecture. 
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to admission control block AC}, where it is used to perform a task admittance 
decision. 

The role of block AC; is to decide whether a task T;, fetched from the j-th 
external input queue, should be executed by the processor. The first condition 
of admittance is that the deadline of T;, D;, is not lower than the sum of its 
worst-case computation time, C; and the current simulation time t. Then the 
output controller value, u;(t), is checked and it influences the decision of the 
task rejection or admission as described in the next paragraph. The admitted 
tasks are placed in the internal processor queue. This queue shall be rather short 
to minimise the delay between decision about admittance and the execution 
of the task, and to keep the timeliness of the lateness input. 

The additional role of block AC; is to scale the voltage of the cores. The 
controller output value, u;(t), is tested against two threshold values +Y and 
—Y. If u;(t) > +Y, the processing cores are more utilised than the setpoint r 
for relatively long period (depending on the I- Window length and k; value) and 
thus increasing the frequency (and voltage) of the set of cores is desirable. On 
the other hand, if u;(t) < —Y, the processing cores are too idle for relatively 
long period and it is recommended to decrease the frequency (and voltage) of 
the cores to conserve energy. It is important to select the value of Y wisely, 
taking into account that u;(t) depends on the current error value (multiplied by 
kp) and on the sum of the previous errors (multiplied by k;) and the length of 
I-Window used during this sum calculation. After choosing these three values, 
it is possible to assign an appropriate value to this threshold. Identification of 
these values and the threshold is performed in Section 3.3. 

Since in any core transferring between various voltage levels is penalised 
both in terms of switching time and energy [52], some mechanism preventing 
too frequent transitions is needed. In our case, we decided to use threshold 
T, which determines the minimal time between two consecutive voltage level 
alterations. Each P-State change request issued earlier than I is ignored. This 
value should be determined by taking into account the hardware parameters 
as a trade-off between the system flexibility (lower parameter value) and 
efficiency (higher parameter value), which is presented in Section 3.3. 

The proposed admission control algorithm is composed in two parts, 
described respectively by lines 1-28 and 29-36 in Figure 3.9, which are 
executed concurrently. The first part consists of the following steps. 

Step 1. Invocation and initialization (lines 1-3, 27): The block functional- 
ity is executed in an infinite loop (line 1), activated every time interval At (line 
27). The current P-State is set to the lowest value (i.e., the highest performance 
— line 2), and the time of the previous P-State change, y, is set to O (line 3). 

Step 2. Task fetching and schedulability analysis (lines 4-5): The tasks 
input FIFO queue is checked if empty (line 4) and a task T; is fetched 
(line 5). 
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Inputs: Task T; (from Task queue) 
Controller output value u 
Admission controller invocation periods At 
Outputs: Task executing or rejection decision 
New P-State of Cores 
Constants: Pmax - maximal P-State available in processor 
Y - threshold value of cumulated error from controller 
T - minimal time elapsed between P-State change 
Variables: P - current P-State 
7 - time of the previous P-State change 


1: while (true) do 
2, P=0 
3 y=0 
4 while (task queue is not empty) do 
5: Fetch T; 
6: if (P = Oand u < 0) 
or (u < 0 and current_time < y +T) then 
7 if P > 0 and current time > y +T then 
8: P=P-1 
9: y = current_time 
10: Clear Window in Controller 
11: end if 
12: Reject task T; 
13: else 
14: ifu<—YandP>0O 
and current_time > y +T then 
15: P=P-1 
16: = current_time 
17: Clear Window in Controller 
18: else if u > +Y and P < Pmax 
and current_time > y +T then 
19: P=P+1 
20: y = current_time 
21: Clear I-Window in Controller 
22: end if 
23: Admit task T; 
24: Send T; to FIFO 
25: end if 
26: end while 
27: Wait At 


28: end while 


29: while (true) do 
30: if (task queue is empty for T and P < Paz) then 


31: P=P+1 

32: Clear I-Window in Controller 
33: y = current_time 

34: end if 

35: Wait At 


36: end while 


Figure 3.9 Pseudo-code of the proposed admission controller functionality. 


Step 3. Task conditional rejection (lines 6-12): If the output value of 
controller (u) is negative and the cores operate with the highest performance 
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(P-State set to PO) or the cores operate below the highest performance and the 
previous change of P-State (at time y) was done early enough (determined 
by condition current_time > y +T), task T; should be rejected (line 6). 
Moreover, if P-State is different from PO and there was no recent change 
of P-State (line 7), P-State is decreased (line 8) and variable keeping the 
previous P-State change time, y, is set accordingly (line 9). The buffer storing 
the previous error values of the controller, I-Window, used by the integral 
component of the PID controller, is cleared (line 10), since the previous errors 
have been obtained in a different P-State and thus should not influence future 
admittance decisions. 

Step 4. Task conditional admittance (lines 14-25): If the controller output 
value is below threshold —Y, the processor performance is not the highest 
possible and the previous change of P-State was done early enough (line 14), 
P-State is decreased (line 15) and the current_time is substituted to y (line 
16). Similarly, provided the controller output value is above threshold +Y, 
the processor performance is not the lowest possible (P-State is different from 
Pmaz, the highest P-State available in the processor) and the previous change 
of P-State was done early enough (line 18), the processor P-State is increased 
(line 19) and the current time is assigned to y (line 20). Task T; is sent to the 
FIFO queue (line 24). 

The second part of the algorithm consists of the following steps. 

Step 1. Invocation (lines 29, 35): The block functionality is executed in an 
infinite loop (line 29), activated every time interval At (line 35). 

Step 2. P-State conditional increase (lines 30-33): If no new tasks have 
been fetched from the task queue for time I’ and the processor performance is 
not the lowest possible (P-State is different from Pax), the processor P-State 
is increased (line 31) and the current time is assigned to y (line 33). 


3.6 Experimental Results 


In order to check the efficiency of the proposed feedback-based admission 
control and real-time task allocation, we developed a simulation model using 
SystemC language. 


3.6.1 Controller Tuning 


Firstly, the controller constant components kp, ki and kq have to be tuned by 
analysing the corresponding open-loop system response to a bursty workload. 
Then random workloads of various weight have been tested to observe the 
system behaviour under different conditions and to find the most beneficial 
operating region. 
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To tune the parameters of the controller, the task slack growth after 
applying a step-input in the open-loop system (i.e., without any feedback) has 
been analysed, as mentioned earlier in this chapter. This is a typical way in 
control-theory-based approaches [7]. The workload used for this case consists 
of 500 independent tasks. They are split into five groups. Tasks belonging 
to one group are released every 5 ms each. After this bursty activity, during 
the following 500 ms no task is released. Then the tasks of the next group 
are released at the same pace. This process is repeated until all tasks from all 
groups are released. The computation time of each task is equal to 50 ms and 
its deadline is set to the sum of computation time multiplied by an arbitrary 
constant (equal to 1.5) and the task release time. This constant has been 
introduced to provide some flexibility in task scheduling; otherwise, all tasks 
would be required to start execution at their release time to meet the timing 
constraints. 

The obtained results have confirmed the accumulating (integrating) nature 
of the process, and thus the accumulating process version of Approximate M- 
constraint Integral Gain Optimization (AMIGO) tuning formulas have been 
applied to choose the proper values of the PID controller components [7]. 
As a reference point, we executed a simulation without DVFS on a system 
comprised of one processor with three cores. As many as 140 tasks have been 
executed before the deadline, no task missed its deadline, and 360 tasks have 
been rejected by the dispatcher. 

To use the DVFS features efficiently, it is crucial to find an appropriate 
value of threshold Y. It should be large enough not to switch a core voltage 
too frequently — the switching should be performed not only due to the high 
value of u(t) generated by the proportional component, but also with the 
relatively large value of the integral component, meaning that the error has 
been large for a longer interval. Simulations results for selected values of T 
are presented in Table 3.3. Form this table it follows that too high values 
of Y result in keeping the current frequency too long at the beginning of a 
busy period, decreasing the performance of the system significantly (see the 
number of tasks executed before the deadline for Y € (30, 60)). Particularly, 
keeping the lowest frequency too long results in executing some tasks after 
their deadlines. At some point (Y = 70 in the considered case), the threshold 
is too high for the given idle period and it does not manage to perform any 
voltage scaling before the next busy period. For further experiments, based 
on the above observations, we have chosen Y = 10 as a trade-off between 
performance and flexibility of the voltage switching. 

In order to determine the threshold T° of the task number that has to be 
processed by the dispatcher between subsequent alterations of the core voltage, 
we performed a series of simulation, where the threshold ranged from 25 ms 
to 400 ms. The highest number of tasks executed before their deadlines is 
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Table 3.3 Total number of tasks executed before and after their deadlines and rejected tasks 
with various Y threshold in the introductory experiment (1 processor with 3 cores) 


Threshold Tasks Tasks Tasks 
Y before Deadline after Deadline Rejected 
5 120 0 380 
10 120 0 380 
20 120 0 380 
30 64 84 352 
40 52 92 356 
50 52 92 356 
60 52 92 356 
70 140 0 360 
00 140 0 360 


observed with threshold T € {25 ms, 50 ms, 100 ms}. The threshold set to 
350 ms or above leads to the behaviour not differentiated from the simulation 
without DVFS. Two values [ = 50 ms and [ = 300 ms have been used in 
the further experiments. 

To estimate the energy used by a processor, ACPI data for Pentium M 
processor (with Intel SpeedStep Technology) has been used, but with slight 
modification the proposed technique can be applied to any processor with 
ACPI implemented’. In Pentium M, there are six levels of allowed frequency 
and voltage pairs, known as P-States. In P-State PO, a core works with 1.6 GHz 
and 1.484 V, whereas for P5 — 600 MHz and 0.936 V, which uses 24.5 W and 
6 W, respectively. 


3.6.2 Random Workloads 


Having selected all the required constants, the efficiency of the system has 
been checked against 11 sets of 10 random workloads, whose release and 
execution time probability distributions are based on the grid workload of an 
engineering design department of a large aircraft manufacturer, as described 
in [22]. Each workload is comprised of 100 tasks, including a random number 
(between 1 and 20) of independent jobs. The execution time of every job is 
selected randomly between 1 and 99 ms. All jobs of a task are released at 
the same instant, and the release time of the next task is selected randomly 
between r; + range_min - Ci and r; + range max - Ci, where Ci; is the 
total worst case computation time of the current tasks T; released at r;, and 
range_min, range_maz € (0,1), range-min < range_maz. These values 
are inversely proportional to the workload weight. 


“For example AMD Family 16h Series Processors ACPI parameters are provided in AMD 
Family 16h Models 00h — OFh Processor Power and Thermal Data Sheet, AMD, 2013. 
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We have measured the numbers of tasks computed before their deadlines 
and the number of tasks rejected by the admission controller block in a 
3-processor system with 3 processing cores and 2-processor system with 4 
processing cores each for systems with DVFS and without DVFS (i.e., with 
T = 00) and presented it in Figures 3.10 and 3.11, respectively. 

The number of tasks admitted with I' = 300 ms is, in total, 26% higher 
than with = 50 ms. The reason for this is that in case of [ = 50 ms, P-States 
are changed more often and thus it is more likely to have a processor with a 
lower frequency and voltage level while a task is fetched, and since decrease 
of P-States is performed gradually (lines 8, 15, 19 and 31 in the algorithm 
in Figure 3.9), tasks are attempted to be executed with lower processor 
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Figure 3.10 Tasks executed before their deadline in random workload scenarios for DVFS 


with [ = 50 ms (red), T = 300 ms (green) and I’ = oo (blue) — three processors with three 
cores (top) and two processors with four cores (bottom) systems. 
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Figure 3.11 Tasks rejected in random workload scenarios for DVFS with T = 50 ms (red), 
T = 300 ms (green) and = oo (blue) — three processors with three cores (top) and two 
processors with four cores (bottom) systems. 


performance. It has been observed that this strategy leads to significant (about 
39%) energy reduction. It may be, however, surprising, that the number of 
the executed tasks is higher with DVFS when I’ = 300 ms than in the system 
without DVES for lighter workloads. This phenomena is innate to the proposed 
technique and can be explained using the pseudo-code in Figure 3.9. In a 
system without DVFS, each processor is always in its lowest P-State, PO. The 
admission controller has then no flexibility in decreasing the P-State while 
the Controller output is negative (checked in line 6) and then to clean the 
I-Window in the PID controller and, finally, admit the task (line 21). 
Different size of the systems does not influence the relationship between 
obtained results. The number of tasks executed before deadlines for assorted 
weights is almost linearly dependent between the two considered architectures 


3.6 Experimental Results 45 


(Pearson Correlation Coefficient p = 0.96; similarly for the number of rejected 
tasks p = 0.97, and for dissipated energy p = 0.98). 

The dynamic energy dissipation, normalised with respect to the highest 
obtained value during the experiment, is presented in Figure 3.12. In general, 
almost 58% of the dissipated dynamic energy have been saved via the DVFS 
approach. For heavier loads from the random workloads, choosing a lower T 
value leads to significant energy reduction, whereas for lighter loads the result 
difference between the two chosen I values is almost negligible. 

Looking at the normalized energy dissipation per task (Figure 3.13), 
computed for the system with three processors, it can be concluded that 
parameter [ = 50 ms leads to more even energy per task usage in comparison 
with I’ = 300 ms, which is slightly more beneficial for lighter workloads only, 
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Figure3.12 Normalized dynamic energy dissipation in random workload scenarios for DVFS 
with I’ = 50 ms (red), T = 300 ms (green) and I’ = oo (blue) — three processors with three 
cores (top) and two processors with four cores (bottom) systems. 
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Figure 3.13 Normalized dynamic energy dissipation per task meeting its deadline in random 
workload scenarios for DVFS with IT = 50 ms (red), T = 300 ms (green) and I = oo (blue). 


but leads to similar energy per task value than in the system without DVFS 
(i.e., = 00) for heavier loads. 


3.7 Related Work 


Dynamic real-time scheduling (RTS) algorithms like EDF and rate monotonic 
support RTS characteristics (worst-case computation time, release rate, etc.), 
but they remain open-loop: once the schedule is built, it stays fixed [80]. Such 
algorithms perform well in meeting QoS levels in predictable systems. How- 
ever, their performance degrade when dealing with unpredictable workloads 
which cannot be modelled accurately a priori [2]. In unpredictable systems, 
sophisticated schedulers like Spring depends on worst-case parameters which 
can result to resource under-utilisation based on pessimistic estimation of 
workloads [135]. 

It is desired to feed the system states back to the scheduler [26, 121], 
so it can be aware of sudden/unpredicted changes and act accordingly in 
order to meet the QoS levels. A system state can be defined as the system 
performance with respect to service capacity, QoS levels etc. Real-time 
systems analysis includes observing how tasks’ RTS characteristics affect (1) 
meeting QoS levels e.g., high processor utilisation, and (2) compute resource 
availability. This can help adapting towards the varying system states by 
enforcing scheduling decisions [26]. This variance can be considered as system 
error which is the deviance from the system output and the desired one (QoS 
level(s)). Some related work like [26, 78] show the lack of adaptivity to varying 
system states in traditional RTS algorithms. 
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In the realm of control-theoretic RTS algorithms, there are adaptive 
approaches that are cost-effective for performance guarantees in systems with 
varying system states [26, 80]. For instance, Lu offers regulating the workload 
of a single-CPU RTS system via a PID-based admission control (PID-AC) 
algorithm to reduce the deadline miss ratio [80]. His algorithm guaranteed 
95% CPU utilisation and 1% deadline miss ratio in comparison to an EDF 
algorithm with 100% and 52% respectively. PID-AC algorithms are plausible 
in some RTS systems where controlling tasks release rate is difficult, instead, 
the scheduler rejects specific tasks to meet QoS levels. 

Also, in [81], Lu uses two PID controllers to meet two QoS levels; 
maximum CPU utilisation and minimum deadline miss ratio. The results 
confirm the findings of [80] in ensuring performance guarantees as opposed 
to basic EDF algorithm. The motivation behind Lu’s 2-PID algorithm was to 
address the stability and transient response issues with the single PID algorithm 
due to PID control limitations handling multiple QoS levels. Our work, in this 
chapter, addresses minimising dependent tasks’ latency, not deadline miss 
ratio, via a PID-based admission control algorithm. 

Control-theory-based voltage level selection of unicore portable devices 
has been firstly proposed in [106]. Varma et al. in [147] choose Proportional- 
Integral-Derivative (PID) controllers to determine voltage of systems dealing 
with the workload not accurately known in advance and interpreted the mean- 
ing of the discrete PID equation terms with regards to dynamic workloads. 
They proposed a heuristic to predict the future system load based on the 
workload rate of change, leading to significant energy reduction. They also 
demonstrated that the design space is not particularly sensitive to changes in 
the PID controller parameters. However, the controller is used to predict the 
future workload and does not use any feedback information from the system 
about the processing core status. 

In [162], a feedback-based approach to manage dynamic slack time for 
conserving energy in real-time systems has been proposed. A PID controller 
is used to predict a real execution time of a task, usually lower than its worst 
case execution time (WCET). Then each execution time slot for a task is split 
into two parts and the first part is executed with a lower voltage assuming the 
execution time predicted by the controller. If the task does not finish by this 
time, a core is switched to its highest voltage guaranteing that the task finishes 
its execution before its deadline. 

In [153] a formal analytic approach for DVFS dedicated to multiple clock 
domain processors benefits from the fact that the frequency and voltage in 
each functional block or domain can be chosen independently. A multiple 
clock domain processor is modelled as a queue-domain network where queue 
occupancies linking two clock domains are treated as feedback signals. 
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A clock domain frequency is adapted to any workload changes by a 
proportional-integral (PI) controller. 

The queue occupancy also drives PI controllers in [31]. In contrast to 
previous research, the limitation of using single-input queues only have been 
lessened and multiple processing stages have been allowed, but a pipelined 
configuration is still required. A realistic cycle-accurate, energy-aware, mul- 
tiprocessor virtual platform is used for demonstrating the superiority of 
feedback techniques over the local DVFS policies during simulation of signal 
processing streaming applications with soft real-time guarantees. It is assumed 
that as long as the queues are not empty, a sufficient number of deadlines is 
met and no further analysis or simulation of deadline misses are provided. 

From the literature survey it follows that there is no previous work on 
mapping the task dynamically to an MPSoC system and using DVFS together 
with control-theory based algorithm. 


3.8 Summary 


In this chapter, we have explored the possibility of applying feedback 
controlled values to dynamic task allocation and admission control for high- 
performance computing clusters executing real-time tasks. Two real-time 
metrics, task lateness and core utilisation, have been applied to perform 
admission control, whereas the former has been also used as a metric for 
dynamic task allocation. Two queue types (EDF and FIFO) have been used in 
the open-loop system. The P and PI controllers have been tuned using classic 
AMIGO and Ziegler-Nichols methods. 

We have prepared simulation models in SystemC language and performed 
a number of experiments. In case of uniform periodic workload the closed- 
loop system has executed almost 5 times more tasks before their deadline 
in comparison with an adequate open-loop system. The queue type and 
controller value have slightly influenced the outcome. The metric-based 
dynamic allocation, in the best configuration, has been about 8.5 per cent 
better than the round-robin method. 

In the case of bursty random workloads with large computation time 
variance, a closed-loop-based system has been about 16% better than the 
corresponding open-loop approach. However, to proper asses the difference 
between these systems in more accurate way, the differentiate between tasks 
of long and short execution time should be introduced. 

We have also explored the possibility of applying feedback control values 
to dynamic task allocation and admission control for multi-core processors 
supporting DVFS while executing firm real-time tasks. Core utilisation has 
been applied as a run-time metric to perform admission control and dynamic 


3.8 Summary 49 


task allocation. The proposed governor algorithm has been tested with various 
parameter values and some guidance for tuning has been provided. 

Even in case of relatively difficult bursty scenarios a significant power 
reduction has been obtained in exchange for executing lower number of tasks 
before their deadlines. It is a role of a system designer to choose proper 
parameter values to obtain a satisfiable trade-off between energy consumption 
and performance. The proposed approach leads to similar results in two 
considered systems of different sizes, thus it may be viewed as quite robust to 
different system configurations. 

The minimal interval allowed between two consecutive switchings of P- 
States (threshold 1”) influences workloads of various weights in different, but 
predictable way. An adaptive choice of I’ value can be then viewed as a simple 
yet effective improvement of the proposed technique. 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


4 


Feedback-Based Allocation and 
Optimisation Heuristics 


The vast majority of existing research into hard real-time scheduling on many- 
core systems assumes workloads to be known in advance, so that traditional 
scheduling analysis can be applied to check statically whether a particular 
taskset is schedulable on a given platform [42]. The hard real-time scheduling 
is desired in several time critical systems such as atomotive and aerospace 
domains [56]. Under dynamic workloads, admitting and executing all hard 
real-time (HRT) tasks belonging to a taskset can jeopardise system timeliness. 
The decision of task admittance is made by admission control. Its role is to 
fetch a task from the task queue and check whether it can be executed by any 
core before its deadline and without forcing existing tasks to miss theirs. If the 
answer is positive, the task is admitted, and rejected otherwise. The benefits 
of this early task rejection are twofold: (i) the resource working time is not 
wasted with a task that will probably violate its deadline, and (ii) a possibility 
of early signalling the lack of admittance can be employed to perform an 
appropriate precaution measures in order to minimize the negative impact of 
the task rejection. 

Dynamic workloads do not necessarily follow the relatively simple peri- 
odic or sporadic task models and it is rather difficult to find a many-core 
system scheduling analysis that relies on more sophisticated models [42], 
[67]. Computationally-intensive workloads not following these basic models 
are more often analysed in High Performance Computing (HPC) domain, for 
example in [35]. The HPC community experience with these tasksets could 
help introducing novel workload models to many-core system schedulability 
analysis [42]. In HPC systems, tasks allocation and scheduling heuristics 
based on feedback control proved to be valuable for dynamic workloads [82], 
improving platform utilisation while maintaining timing constraints. Despite 
a number of successful implementations in HPC community, these heuristics 
are to the best of our knowledge never used in many-core embedded platforms 
with hard real-time constraints. 
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The Roadmap on Control of Real-Time Computing Systems [5], one of the 
results of the EU/IST FP6 Network of Excellence ARTIST2 program, states 
clearly that feedback scheduling is not suitable for applications with hard 
real-time constraints, since feedback acts on errors. However, further research 
[140, 162] show that although the number of deadline misses must not be used 
as an observed value (since any positive error value would violate the hard 
real-time constraints), observing other system’s parameters, such as dynamic 
slack, created when tasks are executed earlier than their worst-case execution 
time (WCET), or core utilisation, could help in allocating and scheduling tasks 
in a real-time system. 

The feedback-based dynamic resource allocation heuristics impose some 
requirements on the target system. Usually, to perform resource allocation 
decision one can rely on various metrics provided by the monitoring infrastruc- 
ture tools and services, such as utilization and the time latency between input 
and output timestamps [81]. The system should also guarantee an appropriate 
level of responsiveness to the decisions made by the heuristics, as well as 
update the values of the metrics used as inputs in the algorithm. Moreover, it 
is important to provide the heuristic algorithm with realistic data about system 
workload, service capacity, worst-case execution time and average end-to-end 
response [84]. 

In order to address the aforementioned issues, we present a novel task 
resource allocation process, which is comprised of the resource allocation and 
task scheduling. The resource allocation process is executed on a particular 
core. Its role is to send the processes to be executed to other processing cores, 
putting them into the task queue of a particular core. Task scheduling is carried 
out locally on each core and selects the actual process to run on the core. 
The proposed approach adopts control-theory based techniques to perform 
runtime admission control and load balancing to cope with dynamic workloads 
with hard real-time constraints. It is worth stressing that, to the best of our 
knowledge, no control theory based allocation and scheduling method aiming 
at hard real-time systems has been proposed to operate in an embedded system 
with dynamic workloads. 


4.1 System Model and Problem Formulation 


In Figure 4.1 the consecutive stages of a task life cycle in the proposed 
system are presented. The task 7; is released at an arbitrary instant. Then an 
approximate schedulability analysis is performed, which can return either fail 
or pass. If the approximate test is passed, the exact schedulability, characterised 
with a relatively high computational complexity [42], is performed. If this test 
is also passed, the task is assigned to the appropriate core, selected during the 
schedulability tests, where it is executed before its deadline. 
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Figure 4.1 Building blocks of the proposed approach. 
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Figure 4.2 A proposed many-core system architecture. 


4.1.1 Application Model 


A taskset T is comprised of an arbitrary number of tasks, IT = {71, 72, 73, ...} 
with hard real-time constraints. The j-th job of task 7; is denoted with 7;,;. 
If a task is comprised of only one job, these terms are used interchangeably 
in this chapter. In case of tasks with job dependencies it is assumed that all 
jobs of a task are submitted at the same time, thus it is possible to identify the 
critical path at the instant of the task release. Periodic or sporadic tasks can be 
modelled with an infinite series of job. The taskset is not known in advance, 
thus the tasks can be released at any instant. 


4.1.2 Platform Model 


The general architecture of the proposed solution is depicted in Figure 5.3. The 
system is comprised of n cores, whose dynamic slacks (slack vector whose 
length |slack| = n) and busyness (vector U, |U| = n) are observed constantly 
by the Monitor block. 

In the Controllers block, one discrete-time PID controller for each 
core is invoked every dt time. The controllers use dynamic slacks of the 
corresponding cores as the observed values. 
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The Admission controller block receives a vector of controllers’ outputs, 
Y = [yi(t),.--,Yn(#)], from the Controllers block. Based on its elements” 
values it performs, as shown in Figure 4.1, (i) approximate schedulability 
analysis of atask admittance or rejection decision. If the decision is positive, an 
(ii) exact schedulability analysis is performed by the Design Space Exploration 
(DSE) block. If at the second stage the result of the task schedulability analysis 
is negative, the task is rejected. Otherwise it is (iii) allocated to a core where 
the execution before the deadline is guaranteed based on the schedulability 
analysis performed in block DSE. 


4.1.3 Problem Formulation 


Given an application and platform models, the problem is to quickly identify 
tasks whose hard timing constraints would be violated by the processing cores 
and then to reject such tasks without performing costy exact schedulability 
analysis. The number of rejected tasks should be reasonably close to the 
number of tasks rejected in a corresponding open-loop system, i.e., the system 
without the early rejection prediction. Meeting the deadlines for all admitted 
tasks shall be guaranteed. 


4.2 Performing Runtime Admission Control and Load 
Balancing to Cope with Dynamic Workloads 


In dynamic workloads, admitting and executing all hard real-time (HRT) tasks 
belonging to a taskset G can jeopardise system timeliness. The role of the 
admission control is to detect the potential deadline violation of a released 
task, 7;, and to reject it in such the case. Then the resource working time is not 
wasted for a task that would probably violate its deadline and early signaling 
of the rejection could be used for minimizing its negative impact. 

The j-th job of task 7;, Ti j, is released at r; j, with the deadline d; and 
the relative deadline D; ; = d; — ri j. The slack for 7; ¿ executed on core Ta, 
where 7, ;, was the immediate previous job executed by this core, is computed 
as follows: 


Fpk— Tij if Ipk + Cp, < Tij < Baby (4.1) 
A 


Cp — Cpe if rij < Tpk + Cp,ko 
Si j = 
if Tij > Fp,k, 


where r; ș is release time of 7; j, Ip — initiation time of 7, ; (also known as 
the job execution starting time), cp and Cp — computation time and worst- 
case execution time (WCET) of 7, k, and Fpy — its worst-case completion 
time. A similar slack calculation approach is employed in [162]. The three 
possible slack cases (Equation (4.1)) are illustrated in Figure 4.3 (top, centre, 
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Figure 4.3 Illustration of task 7;,; slack in three cases from Equation (4.1). 


bottom, respectively). In these figures the solid rectangle illustrates execution 
time (ET) of 7, ;,, whereas the striped rectangle shows the difference between 
WCET and ET of this task. 

The normalised value of slack of currently executed job T; on core Ta is 
computed as follows: 

Dij — Sij 
Dij 
This value is returned by a monitor and compared by a controller with setpoint 
slack_setpoint. An error ea(t) = slacka — slack_setpoint is computed for 
core Ta, as schematically shown in Figure 5.3. Then the a-th output of the 
Controllers block, reflecting the past and previous dynamic slack values in 

core Ta, is computed with formula 


slack, = (4.2) 


IW 


E e 
Yalt) = Kpea(t) + Kr Y ealt- i) + Kp a 
1=0 


(t) = ealt — 1) 
dt , 


(4.3) 
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where Kp, K¡ and Kp are positive constant components of the proportional, 
integral and derivative terms of a PID controller. Their values are usually 
determined using one of the well-known control theory methods, such as root 
locus technique, Ziegler-Nichols tuning method or many others, to obtain the 
desired control response and preserve the stability. In our research, we have 
applied Approximate M-constrained Integral Gain Optimisation (AMIGO), 
as it enables a reasonable compromise between load disturbance rejection and 
robustness [8]. This method has been outlined in Chapter 3. 

The value of slack_setpoint is bounded between values: min_slack_ 
setpoint and max_slack_setpoint, which should be chosen appropriately 
during simulation of a particular system. Similarly, the initial value of 
slack_setpoint can influence (slightly, according to our experiments) the final 
results. In this chapter, it is initialised with the average between its minimal 
and maximal allowed values to converge quickly with any value from the 
whole spectrum of possible controller responses. 

The slacks of the tasks executed by a particular processing core accumulate 
as long as the release times of each task are lower than the worst-time 
completion time of the previous task, which correspond to the first two cases 
in Equation (4.1) and are illustrated in Figure 4.3 (top and centre). It means 
that the slacks of subsequent tasks executed on a given core can be used as a 
controller input value. However, previous values of dynamic slack are of no 
importance when the core becomes idle, i.e., the core finishes execution of a 
task and there is no more tasks in the queue to be processed, which corresponds 
to the third case in Equation (4.1) illustrated in Figure 4.3 (bottom). To reflect 
this situation, the current value of slack_setpoint is provided as an error ea(t), 
to enhance the task assignment to this idle core (since it corresponds with the 
situation that the normalised slack would be twice as large as the current 
setpoint value, i.e., behaves in the way the task would finish its execution 
two times earlier than expected). Substituting this value not only positively 
estimates the task schedulability at the given time instant, but also influences 
future computation of the controller output, as it appears as a prior error value 
in the integral part in Equation (4.3). 

The Controllers block output value Y = [y1 (t),..., yn(t)] is provided as 
an input to the Admission controller block, where it is used to perform a task 
admittance decision. If all Controllers’ outputs (errors) ya(t),a € {1,...,n} 
are negative, the task 7; fetched from the Task queue is rejected. Otherwise, 
a further analysis is conducted by the Design Space Exploration (DSE) block 
to perform exact schedulability analysis. The available resources are there 
checked according to any exact schedulability test (e.g., from [42]), which is 
performed for each core with task 7; added to its taskset as long as a schedulable 
assignment is not found. In our implementation, this analysis has been carried 
out using the interval algebra described in Chapter 2. If no resource is found 
that guarantees the task execution before its deadline, it is rejected. 
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The pseudo-code of the control strategy is presented in Algorithm 3.9. 
This algorithm is comprised of two parts, described respectively by lines 
1-18 and 19-24, which are executed concurrently. The first part consists of 
the following steps. 


e Step 1. Invocation (lines 1, 17). 
The block functionality is executed in an infinite loop (line 1), activated 
every time interval dt (line 17). 


Algorithm 4.1 Pseudo-code of Admission controller involving DSE algorithm 


inputs : Task 7, € I (from Task queue) 
Vector of errors Y[1..n] (from Controller) 
Controller invocation period dt 
slack_setpoint decrease period dt,, dt, > dt 

outputs : Core Ta € Il executing 7 or job rejection 
Value of slack_setpoint 

constants: min_slack_setpoint - minimal allowed value of slack_setpoint 
max_slack_setpoint - maximal allowed value of slack_setpoint 
slack_setpoint_add - value to be added to slack_setpoint 
slack_setpoint_sub - value to be subtracted from slack_setpoint 


1 while true do 
2 while task queue is not empty do 
3 fetch 7); 
4 forall Ya > 0 do 
5 if taskset La U Tı is schedulable then 
6 assign T; to Ta; 
7 break; 
8 end 
9 if 7, not assigned then 
10 reject 7]; 
u if Y, : Ya > 0 A slack setpoint < max_slack_setpoint then 
12 | increase slack_setpoint by slack_setpoint_add; 
13 end 
14 end 
15 end 
16 end 
17 wait dt; 
18 end 
19 while true do 
20 if slack_setpoint > min_slack_setpoint then 
21 | decrease slack_setpoint by slack_setpoint_sub; 
22 end 
23 wait dt; 


24 end 
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e Step 2. Task fetching and schedulability analysis (lines 2-8). 

All tasks present in the Task queue are fetched sequentially (lines 2-3). 
For each task, the Controllers’ outputs are browsed to find positive values, 
which are treated as an early estimation of schedulability (line 4). If such 
value is found in an a-th output, an exact schedulability test checks the 
schedulability of the taskset TI, of the corresponding core ma extended 
with task 7; using any exact schedulability test (line 5), e.g., from [42]. 
If the analysis proves that the taskset is schedulable, 7, is assigned to Ta 
(line 6). Otherwise, the next core with the corresponding positive output 
value is looked for. 

Step 3. Task rejection and setpoint increase (lines 9-15). 

If all cores have been browsed and none of them can admit 7; due 
to either a negative controller output value or the exact schedulability 
test failure, the task 7, is rejected (line 10). In this case, if there 
exists at least one positive value in the Controllers’ output vector, the 
slack_setpoint is increased by constant slack_setpoint_add provided 
that it is lower than constant maz_slack_setpoint (lines 11-12) to 
improve the schedulability estimation in future. 


The second part consists of two steps. 


e Step 1. Invocation (lines 19, 23). 
The block functionality is executed in an infinite loop (line 19), activated 
every time interval dt,, dt, > dt (line 23). 

e Step 2. Setpoint decrease (lines 20, 21). 
The value of slack_setpointis decreased by constant slack_setpoint_sub 
(provided that it is higher than constant min_slack_setpoint), which 
encourages a higher number of tasks to be admitted in future. 


4.3 Experimental Results 


In order to check the efficiency of the proposed feedback-based admission 
control and real-time task allocation process, a simple Transaction-Level 
Modelling (TLM) simulation model has been developed in SystemC language. 
Firstly, the controller components Kp, Kr and Kp have to be tuned by 
analysing the corresponding open-loop system response to a bursty workload. 
Then three series of experiments have been performed. Firstly, a heavy 
periodic workload has been used to observe the behaviour of the overloaded 
system. Due to the regularity in the workload, some convergence of the setpoint 
has been expected. In the second series, workloads of various weight have 
been tested to observe the system behaviour under different conditions and 
to find the most beneficial operating region. Then industrial workloads with 
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dependent jobs have been used to determine the applicability of the proposed 
approach in real-life scenarios. 

To tune the parameters of the controller, the task slack growth response 
on a step-input in the open-loop system (i.e., without any feedback) has been 
analysed. This is a typical way in control-theory-based approaches [8]. As an 
input, a burst release of 500 tasks (with execution time equal to 50 us each) 
has been chosen. The modelled system has been comprised of 3 computing 
(homogeneous) cores. However, any number of tasks can be released, their 
execution time may vary and the number of cores can be higher, whichis shown 
in further experiments. The obtained results have confirmed the accumulating 
(integrating) nature of the process, and thus the accumulating process version 
of AMIGO tuning formulas have been applied to choose the proper values of 
PID controller components [8], similarly as it has been presented in Chapter 3. 
With a series of trial-and-error processes, the following constant values 
have been selected: min_slack_setpoint = 5, max_slack_setpoint = 95, 
slack_setpoint_add = 1, slack_setpoint_sub = 5, the first part of the 
proposed algorithm (Algorithm 3.9) is executed five times more often than the 
second one. 

During the first experiment, the system with the chosen parameters has 
been experimentally evaluated under a periodic workload, consisting of 900 
independent jobs (i.e., each task is comprised of a single job only), one released 
every 5 us, whose WCET equals to 50 us and the relative deadline is equal to 
60 us. These parameters have been chosen appropriately to make the taskset 
heavy enough to saturate the system. The exact schedulability test has been 
performed for each task passing the early estimation based on the controller’s 
output value. The systems with the number of processing cores ranging from 
1 to 11 have been considered. The real execution time (ET) of each task varies 
randomly between 60% and 100% of its WCET, which results in creation of a 
dynamic slack. In the schedulability analysis, since the already executed tasks 
may influence the execution of the task whose schedulability is being tested, 
it is less pessimistic but still safe to provide the ET of these tasks instead of 
their WCET. 

The regularity of the workload should cause convergence of the setpoint 
and decrease the variance of the normalised slack time. If the dynamic slack 
time normalised to task deadlines is close to 0%, it can be treated as an 
indication of well-chosen admission controller algorithm and the controller 
parameters, since it implies that the controller managed to minimize the 
steady-state error. The time needed to obtain this steady state indicates the 
responsiveness of the system, which should not be too long. 

To check the system response to tasksets of various heaviness, nine sets 
of 10 random workloads have been generated. Each workload is comprised 
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of 100 tasks, including a random number (between 1 and 20) of independent 
jobs. The execution time of every job is selected randomly between 1 and 
99 us. All jobs of a task are released at the same instant, and the release time 
of the subsequent task is selected randomly between r; + range min - Ci 
and r; + range_maz - Ci, where C; is the total worst-case computation time 
of the current tasks 7; released at r;, and range_min, range_maz € (0,1), 
range-min < range-maz. These values influence the workload heaviness 
which can be described with Param parameter, which we define as the total 
execution time of all jobs divided by the latest deadline of these jobs. For 
example, for pair range-min = 0.001, range-maz = 0.01, ten random 
workloads have been generated with Param ranging from 208.65 to 237.62, 
with the average value Param = 224.52. The average Param values for the 
generated workloads are given in Table 4.1. The value of | Param] can be 
viewed as a lower bound of the number of cores needed for computing all tasks 
in the workload before their deadlines. It is a rather optimistic value due to the 
bursty nature of the workloads and their deadlines. For example, only 71% of 
tasks are executed before their deadlines from a certain generated workload 
with Param = 4.58 in an open-loop 5-core system, whereas to execute all 
these tasks as many as 13 cores are needed. 


4.3.1 Number of Executed Tasks, Rejected Tasks and 
Schedulability Tests 


4.3.1.1 Periodic workload 

The number of tasks executed before their deadlines while using both ET and 
WCET for schedulability analysis has been compared with the corresponding 
open-loop system in Figure 4.4 (top). As expected, by using actual execution 
time (ET), the number of tasks executed before their deadlines is slightly 
increased. On average, an improvement of 3.7% is achieved. The results 
obtained by closed-loop approaches are clearly worse than those obtained 
with the open-loop approach, where schedulability of each task is analysed 


Table 4.1 Average Param values for random workloads generated with different 
range_min and range-max parameters 


rangemin  range-maz | Param 
0.001 0.01 224.52 
0.0025 0.025 77.07 
0.005 0.05 38.71 
0.0075 0.075 25.56 
0.01 0.1 18.90 
0.02 0.2 9.11 
0.03 0.3 6.12 
0.04 0.4 4.60 
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Figure 4.4 Number of executed tasks (top) and number of tasks rejected by the exact 
schedulability test (bottom) in closed-loop WCET, closed-loop ET and open-loop systems 
for the periodic task workload simulation scenario. 


with an exact schedulability test only. The open-loop approach admits about 
7.6% and 10.9% higher number of tasks than the closed-loop ET and WCET 
case, respectively. However, this improvement is achieved with a significant 
timing overhead. In an extreme case of one core system, 117 schedulability 
tests are to be conducted for the closed-loop ET case (and 233 for WCET) in 
comparison with 900 test executions in the open-loop system. It is important 
to note that only 21 exact schedulability tests for the ET case (and 143 for 
WCET) returned negative results, which demonstrates high accuracy of the 
proposed estimation scheme for open loop system. For higher number of 
cores, the differences are lower since each admitted task is to be checked 
by the exact schedulability test to ensure its hard deadlines compliance. On 
average, more than 38% and 34% of the schedulability tests can be omitted 
for the estimation based on ET and WCET, respectively. This difference is 
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caused by the number of tasks rejected by the exact schedulability test, which 
is illustrated in Figure 4.4 (bottom). 


4.3.1.2 Random workload 

Figures 4.5 and 4.6 present the number of tasks computed before their 
deadlines, rejected tasks and the number of the exact schedulability test 
executions with respect to the number of processing cores (ranging from 1 to 
9) and average values of Param, respectively, for open-loop and closed-loop 
(ET) systems. 

The numbers of executed tasks with respect to Param, both for the open- 
loop and closed-loop systems, are approximated better with power than linear 
regression (residual sum of squares is lower by one order of magnitude in case 
of power regression; logarithmic and exponential regression approximations 
were even more inaccurate). This regression model can be then used to 
determine the trend of executed task number with respect to different workload 
weights. Similarly, the difference between the number of admitted tasks by 
open and closed loop systems can be relatively accurate approximated with 
a power function (power regression result: y = 960.872 1:15, residual sum 
of squares rss = 3646.06). This relation implies that the closed-loop system 
admits relatively low number of tasks when the workload is light. In such 
lightweight condition, the number of schedulability tests to be performed 
is only 12% lower in the extreme case of the set with Param = 4.60. 
Thus, there is no reasonable benefits of using controllers and schedulability 
estimations. In heavier loaded systems, however, the number of admitted tasks 
in both configurations are more balanced, and the number of schedulability test 
executions is significantly varied. For example, for the two heaviest considered 
workload sets (i.e., with Param equal to 224.52 and 77.07) the schedulability 
tests are executed about 65% rarer in the closed-loop system. 

The number of executed tasks grows almost linearly with the number of 
processing cores in both configurations and the slopes of their linear regression 
approximations (both with correlation coefficients higher than 0.99) are almost 
equal. This implies that both configurations are scalable in a similar way and 
the difference between the number of executing tasks in open-loop and closed- 
loop systems is rather unvarying. The number of schedulability test executions 
is almost constant in the open-loop system regardless the number of cores. 
However, for the closed-loop configuration, it changes in a way relatively 
good approximated with a power regression model (power regression result: 
y = 1476.29270-, residual sum of squares rss = 14216.21). Since the 
growing number of processing cores corresponds to less computation on each 
of them, the conclusion is similar as in the Param variation case: the higher 
the load for the cores, the more beneficial is applying of the proposed scheme. 

The number of tasks rejected in the open-loop systems using the exact 
schedulability test is considerably higher for heavier random workloads and 
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Figure 4.5 Number of tasks executed before their deadlines (top), the number of rejected 
tasks (centre) and number of the exact schedulability test executions (bottom) in baseline open- 
loop and proposed closed-loop ET systems for the random workloads simulation scenario with 
different weight of workloads. 


lower number of cores, whereas for lighter random workloads or higher 
number of cores it is similar to the closed-loop ET system. For the closed- 
loop ET systems these figures illustrate the number of false positive errors of 
the approximate schedulability analysis, whereas for the open-loop systems it 
complements the number of tasks executed before deadlines. 
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Figure 4.6 Number of tasks executed before their deadlines (top), the number of rejected 
tasks (centre) and number of the exact schedulability test executions (bottom) in baseline open- 
loop and proposed closed-loop ET systems with different number of processing cores for the 
random workloads simulation scenario. 
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4.3.2 Dynamic Slack, Setpoint and Controller Output 


4.3.2.1 Periodic workload 

Figures 4.7 shows dynamic slack, setpoint and controller outputs during the 
simulation for the periodic task workload executed on a system with 3 cores. 
In Figure 4.7 (top), the initial relative large values of the slack of three cores 
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Figure 4.7 Dynamic slack (top), setpoint (centre) and controller output (bottom) during the 
simulation for the periodic task workload simulation scenario executed by a 3 core system. 


(plotted with three different colors), normalised to task deadlines, equal to 
46%, 42.8%, and 39.9% (the difference is caused with the random ET). The 
slack values decrease fast and after 530 ms none of them is higher than 5% of 
the task deadlines. This implies that tasks are executed relatively close to their 
deadlines, but never miss them. This behavior is obtained due to the exact 
schedulability test performed in the Design Space Exploration block and also 
indicates minimization of the steady-state errors by controllers. Taking into 
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account the initial high values of the setpoint, the time of reaching the low 
normalised slack time can be treated as rather short. However, in random and 
industrial workloads reaching any steady state is very rare due to the lack of 
strict periodic behaviour of the workloads, as shown later in this chapter. 

The early estimation based on controller outputs does not admit too many 
unschedulable tasks (in this experiment only 19 such tasks have been detected 
by the schedulability test). It is visible in Figure 4.7 (centre), where the 
value of setpoint decreases from the initial value to the minimum (by the 
functionality of the 2nd part of the algorithm presented in Algorithm 3.9), and 
after 650 ms no increase is observed. It means that after this time not a single 
unschedulable task has been wrongly identified as schedulable by the early 
estimation. The initial high values of normalised slack and setpoint are also 
reflected in Controllers’ output values (Figure 4.7 (bottom)). Every time the 
value of an appropriate controller output is negative, a released task cannot 
be executed on the corresponding processing core. Despite only a sign of the 
controller output is important for the task admittance, relative large values of 
the controller outputs denote significant variance over observed normalised 
slack, which may be caused with not yet stabilised value of the controller 
setpoint. After about 750 ms the absolute value of the controller outputs are 
rather low, which means that the task slacks observed in the corresponding 
cores are low and the workload is rather predictable as compared to the random 
workload (next experiment). 


4.3.2.2 Light workload 

In Figure 4.8, the observed run-time metrics of the closed-loop 3-core system 
simulation of one selected light workload (with Param = 7.89), taken 
from [23], is presented. From this particular workload, 53 tasks have been 
executed, 569 tasks rejected by the early estimation, and 293 tasks rejected 
by the exact schedulability test. In comparison with the periodic workload 
run-time characteristics, presented in Figure 4.7, more false positive early 
estimations can be observed, which is reflected in higher values in the curve 
depicting the setpoint value (Figure 4.8 (centre)). Since the execution time of 
the tasks assigned to the cores vary significantly (from 1 ms to 95 ms), the 
normalised slack times and consequently controller outputs differ considerably 
for each system core (Figure 4.8 (top, bottom)), but overall decrease trend in 
the normalised slack time can be noticed. 


4.3.3 Core Utilization 


For the periodic workload, Figure 4.9 (top) presents the total utilisation of the 
three cores (100% core utilisation means that all cores are busy at a particular 
instant). Except for the system initialisation, there is no situation that all three 
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Figure 4.8 Dynamic slack (top), setpoint (centre) and controller output (bottom) during the 
simulation for the selected light workload simulation scenario executed by a 3 core system. 


cores are idle. On average, the core utilisation for this simulation is equal to 
83%. All three cores are balanced as the difference in their utilisations does 
not exceed more than 2 per mile. Similar utilisation and balance have been 
observed for other system configurations. 

For the light workload, used also as an example in the previous subsection, 
arelatively long idle period of all cores can be observed (Figure 4.9 (bottom)). 
It is caused with the lack of task release between 10 ms and 490 ms in this 
particular workload. Except for this interval, it is rather difficult to observe any 
controller steady state, which is due to the changeable nature of the workload. 
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Figure 4.9 Core utilisation during the simulation for the periodic (top) and light (bottom) 
workload simulation scenario with 3 core system. 


4.3.4 Case Study: Industrial Workload Having Dependent Jobs 


To analyse industrial workloads, 90 workloads have been generated based 
on the grid workload of an engineering design department of a large aircraft 
manufacturer, as described in [23]. These workloads include 100 tasks of 827 
to 962 jobs in total. The job execution time varies from 1 ms to 99 ms. Since the 
original workloads have no deadlines provided explicitly, relative deadline of 
each task has been set to its WCET increased by a certain constant (100 ms). 

In these workloads all jobs of any task are submitted at the same time, 
thus it is possible at the first stage to identify the critical path of each task 
and admit the task if there exists a core that is capable of executing the jobs 
belonging to the critical path before their deadlines. At the second stage, the 
remaining jobs of the task can be assigned to other cores so that the deadline 
of the critical path is not violated. The outputs from controllers can be used 
for choosing the core for the critical path jobs (during the first stage) or the 
cores for the remaining jobs (during the second stage). Four configurations, 
summarised in Table 4.2, can be then applied. 
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Table 4.2 Four configuration possibilities with respect to controllers’ output usage (OL and 
CL stands for open-loop and closed-loop, respectively) 


Core Selection for Core Selection for Tasks | Configuration 
Critical Path Tasks Outside the Critical Path | Abbreviation 


without controllers without controllers OLOL (baseline) 
without controllers with controllers OLCL 
with controllers without controllers CLOL 
with controllers with controllers CLCL 


Figure 4.10 (top) shows the number of jobs executed before their dead- 
lines. The OLOL configuration can be treated as the baseline, since no control 
theory elements have been applied (only exact schedulability tests are used to 
select a core for a job execution). The cores are scanned in a lexicographical 
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Figure 4.10 Number of executed jobs (top) and number of schedulability test executions 
(bottom) for systems configured in four different ways for the industrial workloads simulation 
scenario. 
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order as long as the first one capable of executing the job satisfying its timing 
constraints is not found, whereas in the closed-loop configurations the tasks are 
checked with regards to the decreasing value of the corresponding controller 
outputs. 

The OLOL configuration approach seems to be particularly beneficial in 
the systems with lower number of cores (heavier loaded with tasks). However, 
in the systems with more than two cores, the OLCL configuration leads to the 
best results. Its superiority in comparison with CLCL stems from the fact that 
an over-pessimistic rejection of critical path jobs leads to fast rejection of the 
whole task. Thus the cost of a false negative estimation is rather high. Wrong 
estimation at the second stage usually results in choosing an idler core. The 
OLCL configuration admits 11% more jobs than OLOL, whereas CLCL is 
only slightly (about 1.5%) better than the baseline OLOL. 

The main reason for introducing the control-theory based admittance is, 
however, decreasing the number of costly exact schedulability testing. The 
number of the exact test executions is presented in Figure 4.10 (bottom). Not 
surprisingly, the wider the usage of controller outputs, the lower is the cost 
of schedulability testing. The difference between OLOL and OLCL is almost 
unnoticeable, but the configurations with control-theory-aided selection of a 
core for the critical path jobs leads to significant, over 30% reduction. 

From the results it follows that two configurations OLCL and CLCL 
dominate the others: the former in terms of number of executed jobs, the latter 
in terms of number of schedulability tests. Depending upon which goal is more 
important, one of them is advised to be selected. Interestingly, only in case 
of low number of processing cores, the baseline OLOL approach is slightly 
better than the remaining ones. For larger systems, applying PID controllers 
for task admissions seems to be quite beneficial. 


4.4 Related Work 


A majority of works that apply techniques originated from control-theory 
to map tasks to cores offers soft real-time guarantees only, which cannot 
be applied to time-critical systems [82]. Relatively little work is related 
to the hard real-time systems, where the task dispatching should ensure 
admission control and guaranteed resource provisions, 1.e., start a task’s job, 
only when the system can allocate a necessary resource budget to meet its 
timing requirements and guarantee that no access of a job being executed to 
its allocated resources is denied or blocked by any other jobs [95]. Providing 
such kind of guarantee facilitates to fulfill the requirements of time critical 
systems, e.g., avionic and automotive systems, where timing constraints must 
be satisfied [56, 75]. 
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Usually hard real-time scheduling requires a priori knowledge of the worst- 
case execution time (WCET) of each task to guarantee the schedulability of the 
whole system [42]. However, according to a number of experimental results 
[48], the difference between WCET and observed execution time (ET) can 
be rather substantial. Consequently, underutilization of resources can often be 
observed during hard real-time system run-time. The emerging dynamic slack 
can be used for various purposes, including energy conservation by means of 
dynamic voltage and frequency scaling (DVFS) or switching off the unused 
cores with clock or power gating and slack reclamation protocol [140]. 

In [162], the authors claim that numerous existing hard real-time schemes 
are not capable of adapting to dynamically changing workloads in a satisfac- 
tory manner and thus do not scale well in the average case, whereas substantial 
energy dissipation savings are possible. An idea of splitting each task’s WCET 
into two intervals is presented, where the length of the first interval is equal to 
the predicted execution time and the remaining part is the second interval. The 
entire dynamic slack, accumulated from previously executed tasks, is meant 
to be consumed during the first interval, by executing the task with lower 
voltage and frequency and, consequently, lower performance. The length 
of this interval is determined with a proportional-integral-derivative (PID) 
controller. Similar approaches have been applied in [3] and [140]. 

In [45], a response time analysis (RTA) has been used to check the 
schedulability of real-time tasksets. This ensures meeting all hard deadlines 
despite assigning various execution frequencies to all real-time tasks to 
minimise energy consumption. In the approach proposed in this chapter, RTA is 
also performed, but it is executed far less frequent due to the fast schedulability 
estimation based on controllers and thus is characterised with shorter total 
execution time. 

Some researchers highlight the role of a real-time manager (RTM) in 
scheduling hard real-time systems. In [74], an RTM is used together with 
computing resources monitoring, while schedule information are precomputed 
from an SDF graph statically to help guaranteeing the real-time constraints. 
We have extended basic ideas from their scheme to work with dynamic 
workloads using information gathered by the monitor. The role of RTM is 
also highlighted in [72]. They described that after receiving a new allocation 
request, it checks the resource availability using a simple predictor. Then 
the manager periodically monitors the progress of all running tasks and 
allocates more resources to the tasks with endangered deadlines. However, 
it is rather difficult to guarantee hard real-time requirements when no proper 
schedulability test is applied. In [131], an RTM exploits information about the 
probability of task execution time to predict the slack available for power 
management. It is assumed that the execution time of a task in terms of 
its worst-case execution time is given by a known cumulative distribution 


72 Feedback-Based Allocation and Optimisation Heuristics 


function. The stochastic nature of this approach prevents it from application 
in hard real-time systems if even tiny probability (e.g., 10-12 [16]) of missing 
any deadline is not allowed. 

From the literature survey it follows that applying feedback-based con- 
trollers in hard real-time systems has been limited to determining the 
appropriate frequency benefiting from DVFS. According to the authors’ 
knowledge, the feedback controller has not been yet used by an RTM to 
perform hard real-time task allocation under dynamic workload on many-core 
systems. 


4.5 Summary 


In this chapter we presented a novel scheme for dynamic workload task 
allocation to many-cores using a control-theory-based approach. Unlike the 
majority of similar existing approaches, we deal with workloads having hard 
real-time constraints that is desired in time critical systems. Thus, we are 
forced to perform exact schedulability tests, whereas PID controllers are used 
for early estimation of schedulability. 

We achieved an improved performance due to reduced number of costly 
scheduling test executions, slightly limiting the number of admitted tasks in 
the majority of cases. The controllers observe dynamic slack of executed tasks 
and aim to select the core with the lowest load. 

For heavy workloads the proposed scheme achieves a better performance 
than using the typical open-loop approach. Up to 65% lower number of 
schedulability tests are to be performed, whereas the number of admitted 
tasks is almost equal for the heavy-loaded system and lower up to 19% with 
lightweight scenarios, for which the proposed scheme is less appropriate. For 
industrial workloads with dependent jobs executed on larger systems, the 
number of executed tasks using the proposed approach was even higher than 
the open-loop baseline system due to the selection of more idle cores for 
computing jobs belonging to the critical path. 

Since schedulability analysis requires relatively long computation time, 
decreasing the number of its executions should lead to considerable com- 
putation time and energy reduction. The exact gain depends on a particular 
system configuration and will be evaluated in our future work. We also plan to 
consider heterogeneous many-core system and extend the proposed approach 
for mixed criticality workloads. 


5 


Search-Based Heuristics for Modal 
Application 


Due to the growing number of electronic control units (ECUs) in contemporary 
cars, sometimes reaching even 100, the automotive industry gradually resigns 
from their paradigm of using a separate unit for each functionality [99]. The 
requirement of placing a number of ever more sophisticated functionalities in 
one chip resulted in appearance of multi-core ECUs [158]. The AUTOSAR 
(AUTomotive Open System ARchitecture) standard [1] assumes a static (1.e., 
compile-time) mapping of atomic software components, named runnables, 
into cores since it is less complex and more predictable than dynamic resource 
allocation [94]. 

Due to the hard real-time constraint in automotive systems, the cores have 
to execute all the tasks on time even for their worst-case execution behavior, 
where they take worst-case execution time (WCET), which is usually much 
higher than the average execution time [152]. One possibility of decreasing 
the difference between the worst and average task execution times stems 
from the modal nature of such applications, i.e., from the fact that they 
can behave in a limited, known at design-time, number of ways, named 
modes. If each mode is analysed independently, the average execution time 
may be closer to the WCET determined for that mode [118]. In [104], six 
modes have been identified in a 4 cylinder gasoline torque based system, for 
example Cranking, Idle and Wide Open Throttle. It has been stressed there 
that execution times of particular runnables differ significantly for various 
modes of an ECU and thus applying different mappings for each operating 
mode may be beneficial. This way a lower number of cores could be needed 
than that of the corresponding system design not considering operating modes. 
However, introducing different mapping for modes imposes significant design 
complications, which have not been analysed in [104]. 

The contexts of runnables that are executed on different cores in different 
modes have to be migrated from one core to another, setting additional 
requirements for the available communication bandwidth. The process of 
mode switching usually incurs overhead (both in execution time and energy), 


73 


74  Search-Based Heuristics for Modal Application 


which is to be taken into account at run-time to decide whether to switch 
to a different mode or not. In hard real-time systems, it is essential to satisfy 
all the timing constraints even during the mode switching process, i.e., the 
migration time of tasks must be time bounded [146]. Therefore, the worst 
case switching time has to be assumed to provide the timing guarantees. 

During the migration process, the taskset schedulability must not be 
violated. To guarantee this property, we propose to treat a migration process as 
any other asynchronous process in schedulability analysis, i.e., to use so-called 
periodic servers, which are periodic tasks executing aperiodic jobs. When a 
periodic server is executed, it processes pending task migration. If there is 
no pending migration, the server simply holds its capacity. To reduce the 
migration time, a recursive greedy algorithm for reducing the amount of data 
transferred during a mode change is proposed. It aims to decrease the number 
of periodic server instances used during a single mode switching. The proposed 
approach can be applied to any hard real-time systems, where different 
operating modes can be identified, and automotive systems in particular. 

As an example, throughout this chapter we will analyse an engine ECU 
code named DemoCar. We will identify its operating modes and apply 
clustering to decrease their number and to eliminate task migration between 
neighbouring mode pairs (i.e., two modes from which at least one mode 
can be directly transferred to the second one) if the mode change is to be 
finished rapidly. The mappings for each mode will be determined using a 
genetic algorithm. This algorithm applies two optimizing criteria: runnable 
schedulability in terms of a number of deadline violations and migration 
cost in terms of the context length of the transmitted runnables. The typical 
schedulability analysis is used to determine the necessary network bandwidth 
to guarantee that the mode switching migration finishes in the required time. 

In the next section, the state-of-the-art solutions are described followed by 
the proposed approach and a discussion on providing performance guarantees 
during mode changes. 


5.1 System Model and Problem Formulation 
5.1.1 Application Model 


In this work we assume application model is consistent with the AUTOSAR 
standard [1]. A taskset T is comprised of an arbitrary number of periodic run- 
nables, I = {71, 72, 73,...}, grouped in tasks with hard real-time constraints. 
The j-th occurrence (j-th job) of runnable 7; is denoted with 7; j. The taskset 
is known in advance, including the WCET of each runnable, C;, its period 
T;, priority P; and its relative deadline D; equal to this period. Runnables 
are atomic schedulable units communicating each other with so called labels, 


5.1 System Model and Problem Formulation 75 


N = {1,...,V,~}, which are memory locations of a particular length. The 
order of read and write operations to labels denotes the runnable dependencies, 
as the write operation to a particular label should be completed before its 
reading. Deadlines for mode changing time between each neighbouring pair of 
modes are also provided. We assume that the labels are stored in the same node 
that the runnable that reads these labels. If more than one runnable mapped 
to different cores read from the same label, its content is to be replicated to 
all the reading nodes and the writer should update the label value at all the 
locations. It means that the writer is aware of all its readers and knows their 
locations in all the possible modes. 


Example 1 Throughout this chapter, we consider a lightweight engine control 
system named DemoCar as an example application. The flow graph of this 
application is depicted in Figure 5.1. It consists of 18 runnables and 61 
labels. All runnables are periodic: 8 runnables (highlighted in green) are 
to be executed every 10 ms, whereas period of 6 runnables (red, blue and 
yellow) equals 5 ms, two (violet) runnables are executed every 20 ms and 
the period of two (orange) runnables is 100 ms. In Figure 5.2 (upper part), 
11 identified modes of this application are presented. These modes have been 
identified by inspecting the code of the runnable named OperatingModeS WC, 
which computes values of transaction and output functions of the Finite State 
Machine steering this engine. For example, label FuelEnabled is read by two 
runnables: TransFuelMassS WC and ThrottleChangeS WC. If these runnables 
are mapped to different cores, the label is to be replicated and kept in both 
the cores where these runnables were mapped to. It is a role of the writer, 
OperatingModeSWC, to update these values coherently not violating any 
timing constraints. 


5.1.2 Platform Model 


The hardware platform assumed in this chapter is a mesh Network on Chip 
(NoC) with a certain number of cores m € Il and routers Y € Y, as shown 
in Figure 5.3. Each link is modelled as a single resource, so, for example, 
to transfer a portion of data from 79,; to appropriate sink 72 y we need such 
resources allocated simultaneously: 70,1 — Vo; Yo, — Yi. Via — Va 
21 — ao, Yao — 72,0. In every mode, each runnable is mapped to one 
core and a label is stored in local memories of the cores requesting that label. 
Data transfer overhead is taken into consideration, assuming constant time for 
transferring a single flit (Flow control digIT, a piece of a network package 
whose length usually equals the data width of a single link) between two 
neighbouring cores if no contentions are present. Timing constants for packet 
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Figure 5.2 Finite State Machine describing mode changes in DemoCar use case: before 
(upper part) and after (lower part) the clustering step. 
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Figure 5.3 An example many-core system platform. 


latencies while traversing one router and one link are denoted as dp and dz, 
respectively. The priority of data transfer packets are assumed to be equal to 
the priority of the runnable sending them. 


5.1.3 Problem Formulation 


Given a platform and an application model with a defined set of operating 
modes, the problem is to determine schedulable mappings for each mode so 
that the amount of data to be migrated during the allowed mode changes is 
minimized. During mode changing, the taskset should be still schedulable 
despite the additional network traffic generated by the task migrations. The 
neighbouring modes (i.e., the modes connected with a link in the FSM 
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describing the allowed mode transitions) with similar runnables’ execution 
time can be clustered to decrease the frequency of task migrations. The 
deadlines for mode changing time between each neighbouring pair of modes 
must not be violated. 


5.2 Proposed Approach 


In this section, steps of the proposed design flow are described. Since it has 
been assumed that the tasksets of the considered application are known in 
advance, it is possible to perform some preliminary computations statically. 
Consequently, the mapping problem can be split into two stages: off-line 
(static) and on-line (dynamic), as shown in Figure 5.4. The computation time 
of the off-line part is not crucial and thus heuristics with even high complexity 
may be used for runnable and label mappings. It seems promising to combine 
the most effective approaches, such as multi-objective simulating annealing or 
genetic algorithms. The possibility of extending genetic operators benefiting 
from the full knowledge of the system domain, such as mutation in a way 
similar to [109], makes the genetic approach the first choice at this step. 

During run time, detection of the current mode is assumed to be done 
by observing certain variable. (In DemoCar such variable is named —sm and 
is stored in runnable OperatingModeSWC.) When a value of this variable 
has been changed, the current runnable and label mapping might have to 
be switched. The mappings have been chosen during the design time with 
respect to minimize the amount of data to be migrated. Schedulability analysis 
guarantees that even the worst case switching time does not violate the deadline 
required for mode changes. If such violation is unavoidable, either the states 
can be clustered, or the network bandwidth is to be increased. 

The off-line part of the proposed approach is comprised of five steps, 
which are covered in the following subsections. 


5.2.1 Mode Detection/Clustering 


The reasons for introducing the mode detection & clustering step are twofold. 
Firstly, some neighbouring modes can be characterized with similar runtime 
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Figure 5.4 Steps of dynamic resource allocation method benefiting from modal nature of 
applications. 
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and resource consumption. Then there is little benefit in preparing different 
mappings for such modes and migrating the runnables when a transition 
between these neighbouring modes is made. Moreover, some transitions are 
required to be done immediately, whereas others can be less time tight. 
If a runnable migration is to be performed quickly, for example between 
two consecutive runnable occurrences, the bandwidth needed to transfer the 
appropriate amount of data in that time may be unreasonably high. Therefore, 
it may be more sensible to merge two modes with such rapid task switching 
time and generate only one mapping for them. 


Example 2 (continuation of Example 1) In DemoCar, transitions between 
modes: Stalled, Cranking, IdleOpenLoop, IdleClosedLoop, Drive, Wait- 
ForOverrun, Overrun, WaitForAFRFeedback and WaitForPowerDownDelay 
are to be performed between two consecutive executions of their runnable 
occurrences, which is upperbounded with 5 ms for 9 runnables. Since perform- 
ing task migration during such short time window would require a bandwidth 
of considerable size, these modes have been clustered into Cluster_1. Finally, 
three modes can be identified after the clustering step: PowerDown, PowerUp 
and Cluster_1, as presented in Figure 5.2 (lower part). 


5.2.2 Spanning Tree Construction 


To minimize the amount of data to be migrated between two consecutive 
modes with the technique proposed in this chapter, the FSM describing mode 
changes should include weights denoting state transition probabilities. Since 
probabilities of staying in the current mode are not relevant at this step, they 
can be omitted for simplicity. The probabilities can be given or determined 
during long simulation of the modal system. The FSM has also to have all 
its cycles removed to guarantee halting of the Static mapping for non-initial 
modes step. In this regard, for an FSM treated as a weighted connected graph 
G(V, E), where V is the set of vertices and E denotes the set of edges, a 
maximum spanning tree can be constructed. We recollect that a spanning tree 
of a graph G is its subgraph T(V, E”), which is connected and whose number 
of edges is equal to the number of vertices minus 1, | E”| = |V|—1.If 7 denotes 
the set of all spanning trees of G, a maximum spanning tree Tinaz(V, Emax) 
of G is a spanning tree iff: 


> 
ber De vedz Y) woz), 
(v,z)€ Emax (v,z)EE" 


where w(v, 2) is the weight value assigned to the edge from a vertex v to z. 
A maximum spanning tree can be constructed in time O(|E|log|V]), e.g., by 
the classic Prim’s algorithm [108]. 
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The operation performed in this step neither influences the application 
behaviour nor limit the possible mode transitions. It only makes the least 
frequent transitions not optimized during stage Static mapping for not initial 
modes minimizing amount of data to be migrated (Figure 5.4). 


Example 3 (continuation of Example 2) For DemoCar, probabilities of mode 
changing have been shown in Figure 5.5 (left). The maximum spanning tree, 
constructed with the Prim’s algorithm, is presented in Figure 5.5 (right). 


5.2.3 Static Mapping for Initial Mode 


Since mapping for each mode is performed off-line, even heuristics known 
from their high computational cost, such as genetic algorithms, can be applied. 
A genetic algorithm used for hard real-time systems shall guarantee that under 
the chosen schedule all timing constraints are satisfied. This can be performed 
in several ways. For example, each missed deadline can impose a certain 
penalty to the fitness function value, and thus each schedule with unsatisfied 
constraints should be eliminated during the evolutionary process. A particular 
mapping is portrayed as a chromosome, stored as a bit string, representing on 
each gene the processing core where the task would be mapped to, similarly 
to [61]. The bit string one-point crossover operator and flip bit mutation have 
been applied together with the tournament selection of the individuals. 

Below, an algorithm encompassing the aforementioned properties is 
described. 

In Algorithm 5.1, it is presented a pseudo-code of a genetic algorithm that 
can be used during Static mapping for initial mode step, the third off-line step 
of the proposed approach, as depicted schematically in Figure 5.4. We propose 
to use two fitness functions — measuring (i) the number of deadline violations 
and (ii) makespan (also known as response time). Both these functions apply 
the interval algebra described in Chapter 2. The first fitness function value is 
of primary importance, as in a hard real-time system no deadline violation 
is allowed. But among fully schedulable mappings, the one leading to a 
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Figure 5.5 Spanning tree construction for DemoCar. 
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Algorithm 5.1 Pseudo-code of no deadline violation with makespan 
minimisation algorithm for the initial mode mapping 


inputs : Workload I”; 
Resource set IT; 
outputs : Task mapping; 


1 Choose an initial random population of task mappings 

2 while not termination condition do 

3 Evaluate the number of deadline violations using IA; //criterion (i) 

4 Evaluate the makespan using IA; //criterion (ii) 

5 Create clusters of individuals with the same number of deadline violations; 

6 Sort the clusters by increasing number of deadline violations; 

7 Sort individuals in each cluster wrt their makespan; 

8 Perform tournament selection; //criterion (i) has higher priority than criterion 


(ii) 


9 Generate individuals using crossover and mutation; 
10 Create a new population with the best found mappings; 
u end 


lower makespan is chosen, since idle intervals can be used to decrease energy 
consumption or execute tasks of lower criticality levels. 

In the algorithm, the following steps can be singled out. 

Step 1. Initial population initialisation (line 1). An arbitrary number of 
random task mappings (individuals) is created. 

Step 2. Creating a new population (lines 3-10). For each individual, values 
of two fitness functions are computed - the number of deadline violations and 
the makespan (lines 3-4). Individuals with the same number of deadline misses 
are grouped together (line 5). The groups are then sorted with respect to the 
number of deadline violations in the ascending order (line 6). Inside each 
group, individuals are sorted according to their growing makespan (line 7). 
The tournament selection is then performed — individuals from a group with 
lower number of deadline violations are always preferred, whereas among 
individuals from one group the one with the lowest makespan is to be chosen 
(line 8). The individuals winning the tournament are then combined using a 
typical crossover operation and mutated (line 9). A new population is created 
(line 10). Step 2 is repeated in a loop as long as a termination condition is not 
fulfilled, which can be a maximal number of generated populations or lack of 
improvement in a number of subsequent generations. 


Example 4 (continuation of Example 3) For the PowerUp (initial) mode of 
DemoCar to be executed on a multi-core embedded system, we evaluate 
makespan and number of violated deadlines during one hyperperiod (i.e., 
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the least common multiple of all runnables’ periods) by allocating runnables 
and labels to different cores. 

The size of the NoC mesh has been initially configured as 2x2 with no idle 
cores, since this size has been earlier checked (also using Algorithm 5.1) and 
is large enough to execute DemoCar in the most computational intensive 
mode, Cluster_1, not violating any of its timing constraints. The genetic 
algorithm is executed again to perform assignment of tasks to cores with 
timing characteristics for the initial PowerUp mode. The genetic algorithm 
has been configured to generate 100 generations of 20 individuals each. The 
first fully schedulable allocation has been found in the Ist generation, which 
suggests that it might be possible to allocate the taskset to a lower number of 
cores. 

After performing further search it has appeared that the taskset in the 
initial mode is schedulable even when mapped to one (out of four) active core. 
The lowest makespan for the NoC with three idle cores is equal to 8622 ys. 


5.2.4 Static Mapping for Non-Initial Modes 


It is of primary importance to migrate as little data as possible during mode 
changes to minimise the migration time. 

Each application A includes a set of tasks and can be represented with 
a vector comprised of p runnables and r shared memory locations (labels) 
of these tasks, A = [71,...,Tp,V1,-..,1Vy], and platform II is composed 
of s processing cores, II = {71,...,7,}. A mapping M is a vector of p 
core locations, M = [7,,,...,77,], Where each element corresponds with the 
appropriate element of A and can be substituted with any element of set II. 
Each element of weight vector W, W = [w;,,..., Wr, ], is equal to the amount 
of data that has to be transferred when a particular runnable is migrated, 
including the labels to be read. 

Let Ma and Mg be sets of mappings that are fully schedulable in a 
given system in state a and (3, respectively. The elements of the difference 
vector Dma,mg = [dz . - - , dr, ] indicate which runnables are to be migrated 
when the mode is changed from a to 8. Each element ds, Ô € [71,...,Tp), 
takes value 1 if the particular runnable/label is allocated to different cores in 
mappings ma € Ma and mg € Mg, and 0 otherwise: 


0, if mas = mas, 
= y i .1 
de a otherwise oe 


where ma and mg, denote the d-th element of vectors ma and mg, 
respectively. The migration cost c between two states a and ĝ is then computed 
in the following way: 
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rene = Dma,mg ` W7. (5.2) 


A recursive greedy algorithm for reducing an amount of data transferred during 
mode changes is presented in Algorithm 5.2. 

Since some cycles are likely to occur in a graph representing the Finite 
State Machine describing transitions between modes, a spanning tree (ST) is to 
be built, as described in the previous subsection. Then the mode corresponding 
to the initial state of the FSM is selected as the current mode (line 1). For this 
mode, a set of schedulable mappings is generated, e.g., with Algorithm 5.2 
(line 2). If more than one schedulable mapping is found, an additional criterion 
crit (e.g., minimum makespan value) is used to select one of them (line 3). 
Then for each direct successor of the ST node corresponding to FSM initial 
state, the FindMappingMin procedure is executed (lines 4 and 5). 

In the FindMappingMin procedure, a set of schedulable mappings for that 
successor node is found using minimal migration cost criterion (5.2) (line 8). 
If more than one schedulable mapping is equally evaluated by this criterion, 
an additional criterion, crit, is used (line 9). The FindMappingMin procedure 
is then recursively run for each direct successor of the ST node provided as the 
function parameter (lines 10 and 11). More mappings could be delivered to the 
FindMappingMin procedure to browse a larger search space by skipping lines 
4 and 9 in the algorithm and providing all elements of Ma instead of just one. 


Algorithm 5.2 Pseudo-code of a migration data transfer minimisation 
algorithm 


inputs : A spanning tree ST based on Finite State Machine (FSM) describing the 
system modes with transaction probabilities; 
W - size of each runnable memory footprint; 
crit - mapping optimality criterion (e.g., minimum makespan value); 
outputs : Runnable and label mapping for each mode; 


Select the initial state of ST and assign it to a; 

Find a set of schedulable mappings Ma; 

Select ma € Ma wrt criterion crit; 

forall 8 being a direct successor of a in ST do 
| FindMappingMin(a, 3, ma); 

end 


aA un h Y nn mm 


7 FindMappingMin(a, 3, ma) 
s Find a set of schedulable mappings Mg minimizing criterion (5.2) using W 
9 Select mg € Mg wrt criterion crit 
10 forall q being a direct successor of 3 in ST do 
u | FindMappingMin(f, q, mg) 
12 end 
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Example 5 (continuation of Example 4) Regardless of the mode, the appli- 
cation has been mapped in a 2 x 2 mesh Network on Chip without deadline 
violations. For the PowerUp mode, schedulable mappings have been found 
even if three of the four NoC cores remains idle, as shown in Example 4. It 
means that in this mode three cores can be switched off, leading to considerable 
energy savings. Similarly, two cores can remain idle in the PowerDown 
mode. (PowerDown requires more computations than PowerUp since some 
maintenance procedures are to be consistently performed.) However, despite 
intensive search using a genetic algorithm, all four cores are needed in the 
Cluster_1 mode to have the taskset fully schedulable. Thus, when the current 
mode changes from PowerUp to Cluster_1, three cores have to be activated, 
whereas two cores can be switched off after leaving the Cluster_1 mode. 

Let us focus on the transition between the PowerUp and Cluster_1 modes. 
For PowerUp, only one core is active and thus all runnables are to be mapped 
to the only active core. However, in other cases a larger set of mappings that 
are fully schedulable on active NoC cores would have been identified. From 
these mappings, the one with the lowest makespan (an additional criterion) 
is chosen. This mapping has been used as a parameter of the FindMapping- 
Min procedure (from the algorithm presented in Algorithm 5.2). The set of 
schedulable mappings following the minimum criterion (Equation (5.2)) is 
identified using a genetic algorithm. By the applied criterion (Min[Cma m a) J 
a significantly lower amount of data has to be migrated during the mode 
change. In the best found case, 3 runnables have to be migrated, whose total 
CPowerUp,Cluster.1 = 261968 bytes. However, by not using this criterion, but 
the minimal makespan instead, the lowest number of runnables to be migrated 
equals 13, which results in CPowerUpz,Cluster 1 = 890162 bytes. In the second 
case, the amount of data to be transferred using a periodic server is about 
240% higher than in the first mapping pair. Since periodic servers offer equal 
throughput during the system execution, the mode change between the latter 
mappings would last more than three times as long as between the former pair. 

During mode change from Cluster_1 to PowerDown, 2 runnables have to 
be migrated and cCiuster 1, Power Down = 113568 bytes. Although the transi- 
tion between modes PowerDown and PowerUp are not optimized, in this case 
only 2 runnables have to be migrated with CPowerDown, PowerUp = 113536 
bytes. 


5.2.5 Schedulability Analysis for Taskset During Mode Changes 


Since some runnables and labels are expected to be located at different cores in 
two different modes, their migration is to be performed during mode changes. 
A runnable migration process is schematically depicted in Figure 5.6. Two 
Mappings ma and mg of runnables Ti, Tj, Tẹ into nodes ma and m, are 
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Ma Me 


T, Tk mA Ti Ty Te 


Ta Tb Ta Th 
Figure 5.6 Example of two different mappings (ma, mg) of runnables Ti, Tj, Tg into cores 
Ta and 7p. 


used in two different system modes: a and 6, respectively. The difference 
between these mappings is the assignment of runnable 7j. We assume no 
deadline violations for both mappings ma and mg. During hyperperiods 
involved in the migration process between a and 6, the schedulability analysis 
for communication resources should take into consideration not only all the 
transfers between ma and m, described in the workload, but also an additional 
periodic job, i.e., the periodic server of certain policy (polling, sporadic, 
deferrable, etc.) with a certain execution time in each period. A technique 
for determining this time is presented in this subsection. 

When the mode changes from a to 5, runnable 7; is to be copied from 
Ta to Tp. Since the precopy strategy is applied, 7, is still executed on core Ta 
during the migration. To migrate runnable 7;, the periodic server is used. The 
whole context of the runnable is transferred during a number of subsequent 
hyperperiods. It is worth stressing that the maximal migration time can be 
computed statically, since the runnable context size and the periodic server 
time slot length and period are known. After this time, it is safe to start 
executing Tj on m, and remove its copy in Ta. 

To guarantee schedulability of runnables, one of the schedulability tests, 
described for example in [42], shall be applied. It is possible to calculate 
the longest possible time interval between the release of runnable 7; and its 
termination, which is referred as 7;’s worst case response time (WCRT) and 
is represented by R;. The schedulability analysis is performed in the way 
described in [9], i.e., by checking whether WCRTs of all runnables do not 
exceed their deadlines. WCRT of runnable 7; can be computed using equation: 


=. j Ri + Rj — C; l 
R=G+ y | 7 le, (5.3) 
VT¡Ehpl(T;) 
where hp(7;) denotes the set of all runnables that can preempt 7;, C; and Cj 
are the worst case execution time of 7; and 7;, respectively, and T} denotes the 
period of 7;. Similarly, the worst case latency rg of packet yx transmitted over 
a link in a mesh NoC with wormhole switching, issued periodically every tz, 
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Figure 5.7 Tasks’ stages in DemoCar: green — runnable execution, red — write to labels; release times and deadlines are provided in ms. 
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can be formulated in a similar manner as that of [61, 100, 122]: 
Th = Ck + bk + lk, (5.4) 


where cz is a basic network latency, bj, is the maximal blocking time 
from lower-priority packages, and l% is the maximal blocking time due to 
interference with higher-priority packets. The basic network latency can be 
computed with the following equation [61, 100]: 


PS 
ce = H: (dr + dr) + [55] - de, (5.5) 
where dp and dz denote the constant packet latencies while traversing one 
router and one link, respectively, PS is the number of bits in the package, 
and F'S is the flit length in bits. H is the hop number between source and 
destination cores. The remaining terms of Equation (5.4) can be computed 
with equations [61, 100]: 


re + (11 — Cc) + Ri 
n= YN E a e 7) 
Vyicinter f (pr) 


where inter f (pp) denotes the direct interference set of yy, which is the set of 
all packets that can preempt px, i.e., have a higher priority and share at least 
one link with yy. The response time of task 7; that releases yz, Ri, has been 
substituted as a maximum release jitter. The term (rı — cı) is an upper bound 
of indirect interference [61]. 

By applying Equations (5.3) and (5.4), both schedulabilities of indepen- 
dent runnables executed on processing cores and packet transmissions can be 
verified. However, jobs in the considered applications, possibly executed on 
different cores, are characterised with various dependency patterns. Typically, 
to start a job execution it is required to have all its parent jobs executed 
(which contributes to so-called computation latency) and all the necessary 
data transferred to the core where this job is assigned to (communication 
latency). 

The goal is thus to establish whether all task-chains of an application have 
their end-to-end deadlines met in a particular platform, and this assessment 
is referred as end-to-end schedulability test. Such test must consider the end- 
to-end latency of each task of a task-chain. To check schedulability of a task 
chain, it is sufficient and necessary to test the individual end-to-end response 
times of all tasks belonging to that chain [73]. In [73], a technique for end- 
to-end schedulability analysis is proposed, but it assumes a pipelined task 
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execution pattern, where multiple jobs of the same task chain are executed 
simultaneously over different cores, but the simultaneous execution of more 
than one job of the same task is not allowed. When the execution pattern 
does not follow this scheme, meeting end-to-end deadlines can be checked 
by assigning an appropriate local deadline for each job in every chain. These 
local deadlines shall be chosen in a way that all the jobs on every core are 
schedulable and the local deadlines at the chain last stage do not exceed the 
respective end-to-end deadline [59]. 


Example 6 (continuation of Example 5) In DemoCar, each task is composed 
with series of three subsequent stages: read from labels, runnable execution, 
write to labels. Since the labels are always located in the same core as that of 
runnable reading these labels, the read stage can be omitted and two remaining 
Stages are presented in Figure 5.7 for all tasks, highlighted in green and 
red. Runnables belonging to one task and drawn one above the other can be 
executed in parallel, whereas the execution order of the runnables follows 
dependencies defined by label write and read operations so that each label is 
to be written by a runnable prior to be read by another runnable. The end- 
to-end deadline has been divided into number of stages in each task and in 
that way deadlines for each stage have been determined (in Figure 5.7 these 
deadlines are written beneath the end point of each stage). 

For example, the release time of runnable ThrottleCtrl is 2ms. By this 
time, all the packets with label values required by this runnable are assumed 
to arrive at the node executing ThrottleCtrl. The deadline for this runnable 
execution is 3 ms, so the WCRT (R;) must not be higher than this value. The 
packages with data are then issued between 2 ms (the runnable release time) 
and Ri. They have to reach their destination nodes earlier than 4 ms. 

To check schedulability of DemoCar, it is then sufficient to check schedu- 
lability for runnables executed in all (green) stages and also data transfers to 
and from labels performed in the appropriate (red) stages. 


Since the earliest execution starting time of each runnable is limited by 
the starting time of the stage including particular runnable, this stage starting 
time can be treated as an offset O; as described in [143]: 


Ri + (R; — C; — O;) — O; 
R=G+ N | ae a i) |; +0. (5.8) 
V7; Ehp(7:) J 
Using similar rationale, Equations (5.4) and (5.7) can be rewritten in the 
following way: 
rk = Ck + bk + lk + Ri, (5.9) 
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O;) 4 Oy 
k= Y la sami Tla +b), (5.10) 
Ver cinter f (pp) 


where term (77 — O;) reflects an additional jitter imposed by the response 
time of task R; that initiates this transfer and O; denotes an offset of the task 
that releases py. One more requirement has to be added to set inter f (pp). 
It includes not only packets having a higher priority and transferred via a 
path sharing at least one link, but also timing boundaries of both their sender 
executions or traffic stages have to overlap. 

As mentioned earlier, the proposed task mapping technique aims to benefit 
from a modal nature of applications, but it also possess new challenges. 
If the modes are treated independently from each other, the end-to-end 
schedulability of runnables and packet transmission in each mode can be 
analysed using Equations (5.8) and (5.9). It is the instant of transition between 
these modes that requires special attention. The task migration time can be 
computed with Equations (5.5), (5.6), (5.9), (5.10), where the packet size, PS, 
is equal to the sum of the header length and the size of the payload including 
the whole context of runnables and labels to be migrated. If a relatively 
large runnable is to be migrated in a highly utilised platform, performing 
the migration when the next job of the runnable is due to start could require 
rather high bandwidth in order not to violate any deadlines. Thus we assume 
to use the precopy strategy, as described in [109]. The job is executed in its 
current (source) location during the mode switching, until all the runnables 
have been migrated to their new (destination) locations. Then the migrated 
runnables are removed from the source location, and their next execution will 
be performed in their destination locations. If a runnable is of combinational 
nature (its outputs depend solely on input values; all DemoCar runnables have 
this property), only the runnable code section is to be migrated. In case of a 
sequential nature of a runnable, the whole context is to be migrated. 

Similarly to [98], we split a runnable context intro two parts: invariant, 
which is not modified at runtime, and dynamic, including all volatile memory 
locations. We assume that an upper bound of the dynamic part size of all 
runnables is known in advance. This part shall be migrated at once using the 
last instance of the periodic server. It means that the local memory locations 
that can be modified by the runnable must not be precopied, but migrated after 
the last execution of the runnable in the old location. This requirement can 
influence the minimum periodic server size and, consequently, the network 
bandwidth, as it must be then wide enough to guarantee migration of dynamic 
part before the next runnable execution (in the new location). This property 
shall be checked using (5.9). 

In the proposed approach, any kind of periodic servers can be used, 
however, the trade-off between implementation complexity and ability to 
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guarantee the deadlines of hard real-time tasks, as described for example in 
[40], shall be considered. 

The number of the hyperperiods required for performing task migration 
depends on the size of runnables and labels to be transferred, mappings, and 
network bandwidth, in particular flit size F'S and timing constants for packet 
latencies while traversing one router and one link dr and dr. 


Example 7 (continuation of Example 6) The flit size, F S, has been fixed to 16 
bits. A few examples of the number of hyperperiods required to migrate tasks 
from PowerUp to Cluster_1, depending on constants da and d | are presented 
in Table 5.1. The hyperperiod length for DemoCar equals 100 ms and this 
time is enough to migrate all data when the router and link latencies are equal 
to 50 and 100 ns, respectively. 


5.2.6 On-Line Steps 


In the proposed approach, only two steps are performed on-line: Detection of 
current mode and Mapping switching. Both of these steps are characterised 
with low computational complexity and thus they impose low overhead for 
the system during run time. 

We assume that the system states are defined explicitly and there is 
a possibility of determining the current state by observing some system 
model variables, similarly to [104]. Otherwise, the most efficient multi-choice 
knapsack problem (MMKP) heuristics, listed in the brief survey earlier, have 
to be applied to identify the current mode on-line, as proposed in [73]. 

When the mode change is requested, an agent residing in each core prepares 
a set of packages with runnables to be migrated via the network. This agent 
is configured statically and is equipped with a table with information which 
runnables have to be migrated during a particular mode change. Then the 
precopy of these runnables is performed. In the following hyperperiods, 
runnables are transported using periodic servers of the length determined 
statically in step Schedulability analysis for taskset during mode change. The 
agent is aware of the number of periodic server instances that have to be used 
during the whole migration process (as in example in Table 5.1), and have the 


Table 5.1 Number of hyperperiods (100 ms) required for switching between states PowerUp 
to Cluster_1 in DemoCar depending on router (dr) and one link latencies (dz) 
da [ns] dz [ns] | No. of Hyperperiods 
5 


0 100 1 
100 200 2 
100 400 3 
200 500 4 
400 800 6 
500 1000 J 
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volatile portion of the context identified. If this instance number elapses, the 
runnables that have been migrated are killed. 

Simultaneously, the same agent can receive migration data from other 
agents in the network. After the appropriate number of hyperperiods, the 
contexts of these runnables are fully migrated and are ready to be executed by 
the operating system. 

The details of the agent depend on the underlying operating system. 


5.3 Related Works 


Systems with distinguishable operating modes are increasingly popular in 
research. A number of research activity aims at developing design-time (off- 
line) heuristics to reduce the number of operating points, since the amount 
of possible scenarios is typically prohibitively high [89]. This Design Space 
Exploration (DSE) process can be carried out using classic heuristic techniques 
(including genetic algorithms [73, 89, 116], tabu search [115], simulated 
annealing [160], particle swarm optimization [103], etc.), or with techniques 
for pruning the design space [107], performing statistical analysis for iden- 
tifying potentially benefiting operating points, or use a priori knowledge of 
the target platform [14]. Then during run-time of that system, a run-time 
manager (RTM) chooses an appropriate operating point according to the 
available platform resources by solving an instance of multi-dimensional 
multi-choice knapsack problem (MMKP). Despite MMKP belongs to the 
NP-hard complexity class, there exists a number of light-weight greedy 
heuristics facilitating finding a quasi-optimal mapping during run-time [73]. 
Alternatively, in [104], there is a possibility of determining the current mode 
out of explicitly given set by observing some variables of the model. In our 
work, the current mode is determined in a similar way. 

Two different mapping approaches are proposed in [119]. In the first 
one, named global static power-aware mapping, each task is assigned to one 
particular processing element independently from the actual scenario. This 
approach reduces the amount of memory required for storing the configura- 
tions and increases the efficiency of run-time management. However, it results 
in increased power consumption in comparison with the second approach, 
dynamic power-aware scenario-mapping, where this assumption is relaxed 
and different mappings for scenarios are stored. These approaches do not 
allow task migration — once a task is assigned to a processing element, it 
remains there until finishing its computation. In contrast, Benini et al. [15] 
allowed tasks to migrate between processing elements when the envisaged 
performance increase is higher than the precomputed migration cost. This 
analysis is performed at each instance of configuration change. 
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In order to analyse the worst case switching time between two modes, it 
is helpful to show the possible modes and transition between them in a formal 
way, using for example Finite State Machines (FSMs), as proposed in [118]. 
In this way it is possible to enumerate all allowed modes and transitions, 
and to check the cost of mode switchings. In [55], an average switching time 
overhead for H.264 decoder has been measured to be equal to 0.2% of the total 
system time. This sligh value has been caused by a low number of existing 
modes, obtained due to the clustering, and thus relatively rare switches. In 
hard real-time systems such decrease of modes by clustering is even more 
crucial and thus it is incorporated in the proposed design flow. In [137], the 
authors suggest to map as many tasks as possible to the same core in various 
modes to avoid the data or code items to be moved between different resources 
when switching between modes. In the proposed approach, we use a genetic 
algorithm to minimize the amount of data to be migrated. 

To perform a schedulability analysis during mode changes, the data 
migration work is performed during time slots allocated to a periodic server. 
There exist different kinds of such servers, including polling servers, sporadic 
servers and deferrable servers, with different replenish policies of server 
execution time [40]. Despite these differences, their period and maximum 
execution time during one period are selected in a way that the chosen end- 
to-end scheduling test proves that no deadline is violated. To decrease the 
timing of best-effort (i.e., migration) task execution, the best-effort bandwidth 
server includes a slack reclaiming procedure and an algorithm for determining 
appropriate server parameters [11]. An example of a direct application of 
synchronized deferrable servers for multi-core systems has been demonstrated 
in [159], but its authors assumed the migration cost to be either negligible, 
or added to the worst-case execution time of each task, which is difficult to 
be applied in systems with network architecture prone to contentions. In the 
proposed approach, more realistic migration time is evaluated, taking into 
account network parameters and interference from other flows. 

In [20], it was experimentally shown that even a total freezing task 
migration strategy, i.e., where the migrated task is stalled while all its code and 
data are transferred through links to the target core, can be used in a NoC-based 
environment and still improve the fulfilment of task deadlines in soft real-time 
systems. To guarantee hard timing constraints, freeze time should be bounded 
and possibly short. One of the possible techniques is a precopy strategy, where 
code and data of some tasks are copied before the actual switching. However, 
this technique is more complicated than total freezing and has higher migration 
time, as some data portions may be required to be copied more than once due 
to their modifications [93]. Storing task code in a few cores and transferring 
only the necessary data is another possibility. However, in doing so, the storage 
overhead at each core can increase by a large amount. 
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A method to guarantee hard real-time for task migration is proposed in 
[98]. However, a costly schedulability analysis is performed during runtime. 
No experiments supporting their proposed approach is provided, but one may 
predict that the overload of that dynamics could be considerable. 

The research described in [104] is the closest to the approach proposed 
in this chapter. Its authors have identified mode transition points in an engine 
management system, and shown that a load distribution by mode-dependent 
task allocation is better balanced in comparison with a static task allocation. 
The performance has been evaluated by simulation, but, contrary to our 
approach, the task migration costs have not been considered. 

From the literature survey it follows that designing real-time systems 
with distinguishable operating modes has been mainly limited to soft timing 
constraints. According to the authors’ knowledge, there is no proposal of any 
method guaranteing no hard deadline violation during task migrations. In 
particular, schedulability analysis has not been applied to check the feasibility 
of task migration process or to determine the worst case switching time 
between two operating modes except of the positional paper [98]. 


5.4 Summary 


An approach of task migration in a multi-core network-based embedded 
system has been proposed as a way to decrease the number of cores needed 
for guaranteing safe execution of a hard real-time software. The steps to be 
performed statically have been described in details using illustrative examples 
based on a lightweight engine control unit. A Finite State Machine describing 
mode changes has been extracted from the software code and transaction 
probabilities have been identified during simulation. The closely related modes 
have been merged into clusters. The most frequent transactions have been 
identified with the classic Prim’s algorithm, and a genetic algorithm has been 
used to determine the runnable-to-core mapping for the initial mode. Similarly, 
a genetic algorithm minimizing the number of migrated data has been used for 
selecting the runnables to be migrated when a change of the current mode is 
requested. The migration time has been evaluated using schedulability analysis 
depending on the network bandwidth. 

The proposed approach requires development of an agent realising the 
migration process. Since its architecture details depend on the underly- 
ing operating system, its implementation and evaluation in real embedded 
environments are planned as a future work. 


Taylor & Francis 
Taylor & Francis Group 
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Swarm Intelligence Algorithms for Dynamic 
Task Reallocation 


The resource allocation mechanisms discussed so far in this book have 
some features in common: they have (some) a priori knowledge about the 
application load they are managing, and they are executed in a centralised 
or hierarchical manner. In this chapter, we explore an approach that does not 
require explicit information about application load, and that is able to make 
decisions about resource allocation in a distributed fashion. To better motivate 
such an approach, let us consider a concrete case study. 

Multimedia applications such as video and audio processing are among the 
most communication and computation-intensive tasks performed by current 
embedded, high-performance and cloud systems. Hardware platforms with 
hundreds of cores are now becoming a preferred target for such applications 
[138], as the computational load in video decoding can be parallelised and 
distributed across the different processing elements on the chip to minimise 
metrics such as overall execution time, power or even temperature. 

An important design constraint in such systems is performance predictabi- 
lity. There is plenty of evidence showing that inconsistent performance in video 
decoding applications can lead to reduced user engagement [46]. However, 
the task of predictably manage multimedia load is not trivial. Video decoding 
execution times vary greatly depending on the spatial and temporal attributes 
of the video [63]. Furthermore, when decoding multiple streams of live video 
(e.g., multipoint video conferencing, multi-camera real-time video surveil- 
lance, multiple view object detection), the workload characteristics became 
increasingly dynamic and thus difficult to model. Thus, efficient resource 
allocation is critical in achieving load balance, power/energy minimisation 
and latency reduction [43]. 

Centralised resource management with a master-slave approach, for 
instance, is a straightforward way to approach this problem, and is probably 
good enough for small systems, but there are many issues that appear as one 
scales up the amount of load to be handled [4, 127]. Cluster based resource 
management techniques have been introduced (e.g., [69]) to overcome the 
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limitations of centralised systems by partitioning the system resources and 
employing multiple cluster managers. Despite those efforts, the complexity 
of dynamic applications and the avaliability of distributed large-scale multi- 
processor systems motivate the investigation of fully-distributed, autonomous 
self-organising/optimising mechanisms [21, 97, 145]. Such systems should 
be able to adapt or optimize itself to changing workload and internal 
conditions and to recover from faults. Many of these systems implement 
self-management features by autonomously controlling and adapting task 
allocation and resource management at runtime. 

As a basis for a distributed resource allocation approach, this chapter 
focuses on the behaviour of biological systems. More specifically, we study 
the swarm intelligence phenomenon, where the individual decisions made 
by the members of a swarm in a distributed manner can result in a global 
behaviour that is beneficial to the whole group. By applying such approach to 
the management of multi-stream video processing load, we hope to show 
its potential and to hint on its applicability to similar kinds of resource 
management problems. 


6.1 System Model and Problem Formulation 
6.1.1 Load Model 


We consider a load model (Figure 6.1) consisting of workflows which resemble 
a container for parallel video stream decoding requests that may arrive at 
arbitrary times, but respecting a specified inter-arrival time. Such load model is 
general enough to describe systems handling a time-varying number of parallel 
video decoding streams. A video stream consists of an arbitrary number of N 
independent jobs. Each job (J;) represents a MPEG group of pictures (GoP), 
and is modelled via a fixed dependency task-graph, and takes the structure 
defined in Figure 6.1. Each node in the task-graph is a MPEG-2 frame-level 
decoder task, and has fixed precedence constraint and communication flow 
shown via the graph edges. A decoder task can only start execution iff its 
predecessor(s) have completed execution and their output data is available. 
A decoder task 7; is characterised by the following tuple: (pi, ti, Xi, ci, ai); 
where p; is the fixed priority, t; is the period, x; is the actual execution time, 
ci is the worst-case computation cost and a; is the arrival time of the decoder 
task 7;. Decoder tasks are preemptive and have a fixed priority. Tasks within a 
job are assigned fixed mapping and priorities at the start of the video stream; 
these exact assignments are used for all tasks of all subsequent jobs in a video 
stream. Tasks of low resolution video streams are given higher priority over 
high-resolution video streams, to ensure low-resolution video streams have a 
lower response-time. 
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Figure 6.1 System overview diagram. 


The spatial resolution of a video stream will correspond directly to the 
computation cost of the task and the payload of the message flows generated 
by those tasks. The exact execution time of the tasks are unknown in advance; 
however, it is assumed that the worst-case computation cost can be estimated. 
Subtask deadlines are unknown but each job is considered schedulable if 
it completes execution within its end-to-end deadline (J? < Deze). The 
response-time of a job (denoted ./;’) is the arrival time of the job to the point in 
time which all of its subtasks have completed execution. A job is considered 
late when (J? — Deze) > 0 and late jobs impact the viewing quality of 
experience (QoE) of the real-time video stream. We assume that the arrival 
rate of jobs are sporadic, and the arrival pattern of new video decoding streams 
are aperiodic. 

Once a task has completed execution, its output (i.e., the decoded frame 
data) is immediately sent as a message flow to the processing element 
executing its successor child tasks, as well as to a buffer in main memory. 
Message flows inherit the priority of their source tasks, with an added offset 
to maintain unique message flow priorities. A message flow, denoted by Msg; is 
characterised by the following tuple: (P;, Ti, PLi, Ci), where P; is the priority, 
T; is the period, PL; is the size of the message payload and C; is the maximum 
no-load latency of message flow Msg;, which can be calculated a priori and 
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usually depends on the topology of the multiprocessor interconnect and on 
the total size of the message (i.e., payload plus headers and other overheads). 


6.1.2 Platform Model 


The multiprocessor platform we target in this chapter is composed of P 
homogeneous processing elements (PEs) connected by a Network-on-Chip 
(NoC) interconnect. Each PE has a local scheduler that handles a task-queue 
which is contained within its local memory. The PEs are directly connected 
to the NoC switches which route data packets towards any destination PE. 
We assume the NoC in our platform model uses fixed priority preemptive 
arbitration, has a 2D mesh topology and uses a deterministic routing algorithm 
such as in [19]. 

In such a platform, the no-load latency C; of a message flow as given in 
Equation (6.1) includes the hop-distance and the number of data units (i.e., 
header and payload flits). 


Ci = (numHops x arbitrationCost) + (numFlits) (6.1) 


We assume that the NoC link arbiters can preempt packets when higher-priority 
packets request the output link they are using. This makes it easier to predict 
the outcome of network contention for specific scenarios. We assume all inter- 
PE communication occurs via the NoC by passing messages. Once a task is 
released from a global input buffer, it is sent to the task queue of the assigned 
PE. The PE upon completing a tasks execution, transmits its output to the 
appropriate PEs dependency buffer. Once a task has completed, the local 
scheduler picks the next task with the highest priority with dependencies 
fulfilled, to be executed next. The resource manager (RM) of the system 
(Figure 6.1), performs initial task mapping and priority assignment and task 
dispatching to the PEs. It also maintains a task-to-PE mapping table of the jobs 
of every admitted and active video stream in the system. The mapping table is 
essentially a hash-table where keys are task-identifiers and values are node- 
identifiers. In this chapter, the terms RM and dispatcher are interchangeable 
as task dispatching is a functionality of the RM. The main responsibility of 
the RM is to make initial mapping decisions for new video streams, and 
to dispatch tasks to the mapped PEs according to the task-to-PE mapping 
table. Most importantly, the system is open-loop as the RM does not gather 
monitoring information from the PEs. 


6.1.3 Problem Statement 


In a centralised closed-loop system, PEs would continuously feedback the 
state of the tasks they were allocated (e.g., their completion time) to a central 
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manager via status message flows. The central manager would then have global 
knowledge of the system in order to make efficient resource management 
decisions for future workloads. As discussed in [4, 127], these advantages 
come at the price of higher communication traffic, congestion hot-spots, 
higher probability of failure, and bottlenecks around the centralised manager. 
Furthermore, such issues are made worse as the NoC size and workload 
increase. 

Cluster-based distributed management approaches can offer a certain 
degree of redundancy and scalability by varying the number of clusters and 
respectively local cluster managers. However, appropriate cluster size selec- 
tion is vital to balance communication-overhead/performance; for example, 
cluster monitoring message flow routes and the cluster manager processing 
overhead will increase as the cluster size increases. Furthermore, the local 
cluster managers are still points of failure in the system, where if one of them 
fails the respective cluster of nodes will severely degrade in performance. 

Fully distributed approaches offer higher levels of redundancy and scal- 
ability over cluster based approaches for large scale systems, due to not 
having any central management nodes. However, due to the lack of global 
knowledge and no monitoring being performed by a centralised authority, 
the system may be load-unbalanced, and cause jobs to miss their deadlines 
and become late. To reduce this job lateness of the admitted dynamic varying 
workload, we follow a bio-inspired distributed task-remapping technique with 
self-organising properties. This technique builds upon existing bio-inspired 
load balancing approaches by Caliskanelli et al. [30] and Mendis et al. [91]. 


6.2 Swarm Intelligence for Resource Management 
6.2.1 PS — Pheromone Signalling Algorithm 


A distributed load-balancing algorithm based on the pheromone signalling 
mechanism used by honey bees has been introduced in [30]. That algorithm, 
henceforth referred to as the PS algorithm, was originally developed to 
improve availability in wireless sensor networks but has features that make it 
attractive as a general load balancing approach. It is based on the concept of 
queen nodes (QN) and regular nodes (WN) in a network, drawing inspiration 
from queen bees and worker bees. The algorithm mimics the process of 
pheromone propagation by queen bees, which by doing so prevent the birth 
of new queens. If a queen dies or leaves the hive, the pheromone levels 
decay and worker bees are then triggered to feed larvae with royal jelly 
and thus differentiate one of them into becoming the new queen. Perhaps 
counter-intuitively, the PS algorithm uses the pheromone signalling process 
to select queen nodes that will be allocated workload (there can be many 
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queens in a system), while worker nodes are not allocated any load unless 
they become queens themselves (which is a completely different behaviour 
from the biological system that inspired the approach). QNs are dynamically 
differentiated from other nodes to indicate they are ready to handle a workload, 
and the aim of the algorithm is to produces sufficient QNs to handle all the 
required system functionality. 

The algorithm is based on the periodic transmission of pheromone by 
QNs, and its retransmission by receipients to their neighbours. The pheromone 
level at each node decays with time and with distance to the source. All 
nodes accumulate pheromone received from QNs, and if at a particular time 
the pheromone level of a node is below a given threshold this node will 
differentiate itself into a QN. This typically happens when this node is too 
far from other QNs. The PS algorithm consists of three phases, which are 
executed asynchronously on every node of the network: two of them are time- 
triggered (differentiation cycle and decay of pheromone) and one of them is 
event-triggered (propagation of received pheromone). 

The first time-triggered phase, referred to as the differentiation cycle 
(Algorithm 6.1), is executed by every node of the network every Ton time 
units. On each execution, the node checks its current pheromone level h; 
against a predefined level Qrp. The node will differentiate itself into QN 
(or maintain its QN status) if h; < Qry, otherwise it will become a WN. If 
the node is a QN, it then transmits pheromone to its network neighbourhood 
to make its presence felt. Each pheromone dose hd is represented as a two- 
position vector. The first element of the vector denotes the distance in hops 
to the QN that has produced it (and therefore is initialised as 0 in line 4 of 
Algorithm 6.1). The second element is the actual dosage of the pheromone 
that will be absorbed by the neighbours. 


Algorithm 6.1 PS Differentiation Cycle 


Input: Differentiation period Ton, local pheromone level h;, local threshold 
Qru, initial pheromone dosage han 

Output: Pheromone dose hd, Queen Node status QN; 

while true do 

if h; < Qry then 

QN; = true; 

broadcast hd = 10, hen}; 


| QN; = false; 
end 
wait for Ton; 


1 
2 
3 
4 
5 else 
6 
7 
8 
9 


end 
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The event-triggered phase of PS deals with the propagation of the 
pheromone released by QNs (as described above in the differentiation cycle) 
and received at neighbouring nodes. The purpose of propagation is to extend 
the influence of QNs to nodes other than their directly connected neighbours. 
Propagation is not a periodic activity, and happens every time a node receives 
a pheromone dose. Its pseudocode appears in Algorithm 6.2. Upon receiving 
a pheromone dose, a node checks whether the QN that has produced it is 
sufficiently near for the pheromone to be effective. It does that by comparing 
the first element of the vector hd with a predefined thresholdpopcount- If the 
hd has travelled more hops than the threshold, the node simply discards it. If 
not, it adds the received dosage of the pheromone to its own pheromone level 
h; and propagates the pheromone to its neighbourhood. Before forwarding 
it, the node updates the hd vector element by incrementing the hop count, 
and by multiplying the dosage by a decay factor 0 < Khopdecay < 1. This 
represents pheromone transmission decaying with distance from the source. 
Figure 6.2 shows four WNs connected to a QN and retransmitting a lower 
dose of pheromone to their neighbours. 

The second time-triggered phase of the algorithm, shown in Algorithm 6.3 
is a simple periodic decay of the local pheromone level of each node. Every 
Taecay time units, h; is multiplied by a decay factor 0 < Krimedecay < 1. It 
can be easily inferred from the PS differentiation cycle that each node makes 
its own decision on whether and when it becomes a QN by referring to local 
information only: its own pheromone level h;. This follows the principles 
of swarm intelligence, where decisions are based on local information and 
interactions within a small neighbourhood. 

The computational complexity of the PS algorithm is very low, as each of 
the phases is a short sequence of simple ALU operations. The communication 
complexity, which in turn determines how often the PS propagation step 


Algorithm 6.2 PS Propagation Cycle 


Input: Propagation threshold thresholdropcoun+, decay factor Kropdecay» 
pheromone dose hd 
Output: Updated pheromone dose hd 
1 if hd received then 


2 if hd[1] < thresholdpopeount then 

4 broadcast hd = {hd[1] + 1, hd[2] x Khopdecay Y; 
5 else 

6 | drop hd; 

7 end 

8 end 
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Figure 6.2 PS pheromone propagation 


Algorithm 6.3 PS Decay Cycle 


Input: Decay period T’pzcay, local pheromone level hj, 
decay factor Krimedecay 
Output: Updated local pheromone level h; 
while true do 
hi = hy x Ktimedecay} 
wait for Tpecay; 
end 


Bw Ne 


is executed, depends on the connectivity of the network and on the Taecay 
parameter. The protocol also provides a stability property, in that a lone node 
with no peers will become and always remain a queen node after a given delay, 
unlike in other distributed resource management approaches where nodes may 
be probabilistically deactivated for some intervals. 


6.2.2 PSRM - Pheromone Signalling Supporting Load 
Remapping 


The PS algorithm described in the previous subsection can be used as a general- 
purpose load balancing approach. Given a set of parameters, it will converge 
to a set of QNs that will then serve the system workload as it arrives. Once 
a QN becomes fully utilized, it can simply decide not to be a QN anymore. 
By stopping pheromone propagation, its neighbours’ pheromone levels will 
reduce over time due to the decay phase of the algorithm, until one or more 
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will have their levels below the threshold and will differentiate themselves into 
QNs. The new QNs will then be ready to handle new workload as it arrives. 

In this subsection, we explore another possibility: using the PS algorithm to 
handle load remapping. In this case, it enables distributed resource allocation 
to optimise a centralised allocation mechanism which may not be fully aware 
of the state of each resource. Such variation of the PS algorithm has been 
applied to the problem of dynamically allocating video streams to a NoC-based 
platform, as described in Section 6.1. 

Algorithm 6.4 shows extensions made to the original PS differentiation 
cycle, aiming to support the remapping functionality. In the original algorithm, 
Qu is fixed as a parameter of the algorithm. In this case, Qr p is dynamically 
adjusted depending on the workload mapped on the resource (namely, the PE 
connected to the NoC). The cumulative slack of the tasks mapped on the PE is 
used to vary the QN threshold Qrp, such that a node will differentiate itself 
into a QN if it has enough slack to accommodate additional tasks (line 4). The 
slack of a task is calculated as the difference between the relative deadline (d;) 
and the observed response-time of the task r;. A negative cumulative slack 
value indicates the PE does not have any spare capacity to take additional 


Algorithm 6.4 PSRM Differentiation Cycle 


Input: Differentiation period Ton, local pheromone level hj, 
local threshold Qu, initial pheromone dosage han 
Output: Pheromone dose hd, Queen Node status QN; 

1 while true do 


/* calc. normalised cumulative TQ slack */ 
(d¿—r;) 
2 TOstack = EPR (di) ; 
VT,EPEMPT 

/* calc. QN threshold */ 
3 if TQsiack > 0 then 
4 Qru = Qru x (1+ (TQstack x QfxH)); 
5 else 
6 Qru = hi x OF: 
7 end 

/* determine queen status */ 
8 if hi < Qru then 
9 QN; = true; 
10 broadcast hd = (0, how, QNay, PEmprinfo}; 
u else 
12 QN, = false; 
13 end 
14 wait for Ton; 


1s end 
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tasks, and hence the node is converted or remains a worker node. Line 2 in 
Algorithm 6.4 shows the calculation of the task queue (TQ) cumulative slack 
(TQ stack) of the mapped tasks. If TQ slack is positive, Qrp is incremented by 
a ratio defined by (TQ stack + QF1,); where (QS y € R| 0 < Qh, < 1) isa 
parameter of the algorithm. If TQ slack is negative, then the algorithm ensures 
the node does not become a QN in this differentiation cycle, by setting Qry 
as a proportion of h; as given in Line 6; here {Qe woe k (Os Qi E 
is also a parameter of the algorithm. The self-organising behaviour of the 
distributed algorithm (specifically the Differentiation cycle in Algorithm 6.4), 
stabilises the number and position of the QNs in the NoC, as time progresses 
and depending on the workload. A node propagates pheromones immediately 
after it becomes a queen (line 10). We represent the pheromone dose (hd) as 
a four position vector containing the distance from the QN, the initial dosage 
(han), the position of the QN in the network (@N,,,) and a data structure 
(PEuprinfo) containing the p; and c; of the tasks mapped on the QN. 
The worker nodes will receive and store this information as the pheromones 
traverse through the network. 

Now that the extension to PS has been introduced, let us focus on its 
integration to a centralised mapping of video stream tasks. We assume the 
centralised mapping is performed according to a lowest worst-case utilisation 
heuristic. Once mapped, atask within ajob may be late due to the PE ornetwork 
route being over-utilised and/or due to the heavy blocking incurred by higher- 
priority tasks and flows. We then aim to change the task-to-PE mapping of late 
tasks, such that these causes of lateness can be mitigated. The task-remapping 
procedure (Algorithm 6.5) is executed by each PE periodically, using only its 
local knowledge gathered via the pheromone doses. 

Algorithm 6.5 illustrates the proposed remapping procedure that utilises 
the adapted PS algorithm, denoted as PS RM. The following steps occur at 
each remapping cycle (seen in Figure 6.3). Firstly the task with the maximum 
lateness T;, AX from the PE task queue, is selected as the task that needs to be 
remapped to a different PE (line 1). The deadline of a task (d;) is calculated as a 
ratio of the end-to-end job deadline (Depe), as given in [65]. Each node is aware 
of the nearest QNs (Q List), and their mapped tasks, by storing the information 
received from each pheromone dose hd. In Lines 3-10, the algorithm evaluates 
the worse-case blocking that will be experienced for the target task Ge AX and 
the number of lower priority tasks that will be blocked, by mapping it onto 
each Qi € Qrist. Once a list of QNs with lower blocking than the current 
blocking is obtained (lines 7-9), they are requested (RQ) for their availability 
(line 11) via a low payload, high priority message flow. The QNs reply (REP) 
with its availability (1.e., if other worker nodes have been remapped to a 
QN in that remapping cycle, then the QNs’ availability is set to false). This 


avoids unnecessary overloading of QNs. Te AX will then be remapped to the 
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Algorithm 6.5 PSRM Remapping 


1 while true do 


/* find most late task from task queue */ 
2 | TMPXL=MAX(([r; € TQ | (ai + di) < te}); 
/* get current blocking for late task */ 
3 B(7MAX-L) = getCurrentBlocking(hp(T M4), 
/* find suitable QNs which offer lower blocking, than 
current blocking */ 
4 QTist = ve 
5 foreach Q; € Qris: do 
/* get target task blocking factor */ 
6 Self Ba = y cj; 
Vr; Ehp(rMAX-L) 
/* get number of lower priority tasks */ 
7 LPsize = WA ; 
8 if Self Bo < B(r4*~) then 
9 | Insert (Q;, LPsize y to QP ict 
10 end 
u end 
/* request for QN availability */ 


12 Avlb_QP,., = requestAvailability(QP,.,); 
/* get available QN that has least amount of lower 


priority tasks */ 
13 {Q NTP, LPO TY} = MIN (Avlb-Qiist); 
/* Update dispatcher task-mapping table */ 
14 Notify dispatcher: 7 M4X-1 — PE(QM!N+?); 
15 wait for Tru; 


16 end 


QN with the least number of lower priority tasks (denoted QM IN-LP) from 
the available QN list (Avlb-QF, st) (line 12). Finally, the task dispatcher is 
notified via message flow to update the task mapping table; the dispatcher 
looks up the task-id in the table and updates the corresponding node-id with 
the new remapped node-id. When the tasks of the next job of the video stream 
arrives into the system they will be dispatched to the node-id indicated by the 
updated mapping table. Therefore, remapping will only take effect from the 
subsequent arrival of the next job in the video stream. Even though there is 
an update message sent to the dispatcher at a remapping event, the remapping 
decision is achieved purely using local information at each PE, based on the 
PSRM algorithm. 

Figure 6.4 illustrates an example of the remapping procedure in a 4 x 4 
NoC. The (x, y) coordinates refer to the processing node in column x and 
row y. In step 1 of Figure 6.4, at each remapping interval (In) each PE 
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Figure 6.3 Sequence diagram of PSRM algorithm related events. Time triggered (periodic): 
PSDifferentiation, PSDecay and Remapping cycles; Event triggered: PSPropagation. 


identifies the late tasks in their task queues; they are also aware of the position 
of any nearby QNs due to the pheromone signals. 7; and 72 on PE(1,0) and 
PE(2,2) are tasks that are late, at that time instant. In step 2, they determine 
the suitability of each QN to remap the late tasks to. 7, can either be remapped 
to Q(1,1) or Q(3,0); and 72 can be remapped on to either Q(3,2) or Q(1,1) but 
Q(3,2) is not suitable due to the task blocking behaviour and Q(0,3) is not in 
the Q Lis: due to distance. In step 2, the nodes request for the suitable QNs’ 
availability; in this instance PE(1,0) obtained a lock on Q(1,1) first. Hence, 
Tı will be remapped onto Q(1,1) and 72 will be remapped to Q(2,3). In step 3 
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Step 1 Step 2 
(Tam: Each PE (Tam : Idenitify and 
identifies query suitable queens 
late tasks (e.g, T,, T3) for availability 


Step 3 Step 4 
(Tam: Remap to suitable @Next job received, 
and available queen. Ta, Ta dispatched to 
Notify dispatcher new PEs 


Figure 6.4 Task remapping example. (Q = queen nodes; D = Dispatcher; [71,72] are late 
tasks; Blue lines represent communication. 


the PEs notify the dispatcher via a message flow regarding the remapping. 
In step 4 the next job arrives and mə and 73 are now dispatched to the new 
processing elements — PE(1,1) and PE(2,3) respectively. 

The performance of adaptive algorithms such as PSRMis highly dependent 
on the selection of a good set of parameters. Manual selection of parameters 
is not feasible due to the size of the search space. Table 6.1 shows several 
important parameters obtained via a search-based parameter selection method 
inspired by [29]. The parameters Ton, Tnecay and Try and their ratios 
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Table 6.1 PSRM algorithm parameters 


Differenciation cycle (Tan) 0.22 
Decay cycle (pecar) 0.055 
Remapping period (IRu) 6.9 
Default QN threshold (QT) 20 
QN threshold inc./dec. factors (Q7 p, Qp) 0.107, 0.01 
Pheromone time and hop decay factors 0.3, 0.15 
Pheromone propagation range 3 


play a key role in obtaining a good performance from the algorithm. The 
experimental results during the parameter search process show that the 
remapping frequency has a significant impact in accuracy and communication 
overhead. The relationship between these parameters have been investigated 
extensively in previous work [29, 30]. As a general guideline, to keep the 
communication overhead low, the event cycles (Ton and Tirar) and the QN 
hormone propagation range must be kept relatively low. 

The platform model used in this case study has fixed priority preemptive 
NoC arbiters and local schedulers. Hence, tasks and flows can be blocked by 
higher priority tasks and flows. The remapping heuristic takes into account 
the new tasks’ blocking incurred by a possible remapping (lines 4-10 of 
Algorithm 6.5). However, since the processing nodes lack a global view of the 
communication flows, the remapping heuristic cannot take into account the 
change in the overall network communication interference pattern caused by 
the reallocation of the tasks. Therefore, there are situations where remapping 
a task can result in an actual lateness increase. As shown in Figure 6.3 
and Algorithm. 6.4, every Ton time units the worker nodes get updates 
from all QNs in close proximity to them. However, between subsequent 
PSDifferentiation events, the workload of the QN can change rapidly when 
the system is heavily utilised, which may lead to inaccurate local knowledge 
regarding the nearby QNs. Furthermore, late tasks should be remapped ideally 
before the next job invocation. However, the remapping event is periodic (1.e., 
every Trm seconds) which allows the remapping overhead to be kept at a 
minimum, but does not guarantee synchronisation with the workload arrival 
pattern. Longer periodic events may lead to inconsistency in data and states, 
but are used to keep the communication overhead at a minimum. 


6.3 Evaluation 
6.3.1 Experiment Design 


To evaluate the resource allocation approach described in this chapter, we 
performed a number of simulations of realistic load patterns allocated over 
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a 100-core Network-on-Chip platform. A discrete-event, abstract simulator 
described in [92] has been adopted. The volume of load was configured such 
that there would be an upper limit of 103 parallel video streams at any given 
time in the simulation. Experiments were performed under 30 unique workload 
situations, where the number of videos per workflow, their resolutions and 
arrival patterns vary based on the randomiser seed used in each simulation run. 
The computation to communication ratio of the workload was approximately 
2:1. The resolution of the video streams were selected at random from a list 
of low to high resolutions (e.g., from 144p to 720p). The inter-arrival time 
of jobs in a video stream were set to be between 1 to 1.5 times the Dese. 
Tasks were initially mapped to the lowest utilised PE (according to worst- 
case utilisation) and priority assignment of the tasks followed a scheme were 
the lowest-resolution tasks get the highest priority. This initial mapping and 
assignment scheme were constant variables for all evaluations. 


6.3.1.1 Metrics 
The experiments have multiple dependent variables as described below: 


e Total number of fully schedulable video streams is the number of all 
admitted video streams that have no late jobs (i.e., J? < Deve). 

e Cumulative job lateness (7 is calculated as the summation of 
lateness of all the late jobs from every video stream (v;) admitted to 
the system (Equation ( 6.2)). In Equation ( 6.2), JE is a late job and VS 
denotes all the video streams admitted to the system. We measure the 
job lateness with remapping enabled/disabled, hence a reduced os. 
when remapping is enabled is considered an improvement to the 
system. This metric gives us a notion of how the remapping technique 
reduced the lateness of the unschedulable video streams, which directly 
affects the QoE of the video stream. 

e Communication overhead is calculated as the sum of the basic latencies 
(Ci) of every control signal in the respective remapping technique. 
In the PSRM algorithm these are the pheromone broadcast and QN 
availability request signals. In the cluster-based technique the PE status 
update traffic and the inter-cluster communication traffic contributes to 
the overhead. Furthermore, the task dispatcher notification messages 
in all the remapping techniques are included in the overhead. Lower 
communication overheads lead to less congested networks as well as 
lower communication energy consumption [39]. 

e Distribution of PE utilisation is calculated by the measured total busy 
time for every PE on the network during a simulation run. PE utilisation 
gives a notion of the workload and a lower variation in workload 
distribution is desirable. Overloading a single resource and/or having 
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a high number of idle PEs, are undesirable properties which may lead to 
reduced reliability and increased wear-and-tear. 


Gps S|. SS Dee) (6.2) 


VuieVS |VWIFEv; 


Comms. overhead = X Ci (6.3) 
Vmsg; € Control Msgs 


6.3.1.2 Baseline Remapping Techniques 
The PSRM resource manager was evaluated against the following baselines: 


e CCPRMva2 — is a cluster-based management proposed in [91] as an 
improvement of the original work by Castilhos et al. [33]. It is configured 
with a cluster size of 2 x 5 (i.e., 10 clusters). 

Centralised management — is essentially CCPRMy2 with only one 

10 x 10 cluster. A single centralised resource manager receives status 

updated from every slave PE in the network and performs periodic 

remapping as described in [91]. The manager notifies the task dispatcher 
of any remapping decisions. 

e A random remapper — is a remapping scheme where, every remapping 
interval each PE selects the most late task in its task queue and randomly 
selects another node on the network to remap to. The task dispatcher is 
notified of the remapping event. 


6.3.2 Experimental Results 


6.3.2.1 Comparison between clustered approaches 

The comparison of CCPRMy, and CCPRM y» for the C eee metric is shown in 
Figure 6.5(a). In this plot a positive improvement indicates that task remapping 
has helped to reduce the cumulative job lateness in the admitted video streams. 
A negative improvement indicates that the remapping has instead worsened the 
lateness of the jobs. Each sample in the distribution corresponds to a simulation 
run with a unique workload. It is clear that the modifications made to the 
original CCPRMy. technique has resulted in an improvement in reducing job 
lateness. In CCPRMy a majority of the data shows negative improvement, 
while CCPRMy2 shows more positive job lateness improvement. However, 
this improvement has costed a 4% increase in communication cost. Certain 
constraints in the local remapping decisions in CCPRMyy2 would result in more 
communication with neighbouring clusters which might explain the increased 
overhead. 
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Figure 6.5 Comparison of CCPRMy1 (original) and CCPRM y 2 (improved). (a) Cumulative 

job lateness improvement. (b) Communication overhead. 


6.3.2.2 Comparison regarding video processing performance 

Figure 6.6 shows the distribution of cumulative job lateness improvement 
for each of the remapping techniques. Firstly, all the techniques show both 
negative and positive improvements; hence, under certain workload situations 
the remapping techniques have failed to improve the lateness of the jobs. 
However, a majority of the distribution in both PSRM and CCPRM yy, are in the 
positive improvementregion. PSRM has a smaller spread in lateness compared 
to the baselines. The upper quartile and a significantly large proportion of the 
inter-quartile range (IQR) falls in the positive improvement area, which is not 
seen in any of the baselines. In over 60% of the workload scenarios PSRM 
will produce positive improvement to the job lateness of the video streams but 
the actual improvement is small (up to 39%-4%). Futhermore, in Figure 6.7, 
we can see PSRM is marginally better than the CCPRMy2 in the number 
of fully schedulable video streams. CCPRMy2 shows a better job lateness 
improvement over the centralised management, because the monitoring traffic 
1s shorter in route-length and hence is less disruptive to the data communica- 
tion. We can see that the centralised management has the highest number of 
schedulable video streams out of the evaluated remappers. This could indicate 
that CCPRMy2 and PSRM gave significant job lateness improvements only 
to a few video streams while the centralised management was able to make 
minor improvements to multiple video streams. The random remapper shows 
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Figure 6.6 Distribution of cumulative job lateness improvement after applying remapping. 
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Figure 6.7 Comparison of fully schedulable video streams for each remapping technique. 


the worst results with a majority of the experiments resulting in negative 
improvements and produces the lowest number of fully schedulable video 
streams. It was interesting to note that there were a few scenarios where 
random remapping produced significant job lateness improvements, which is 
seen by the high upper whisker in the box plot (Figure 6.6). 


6.3.2.3 Comparison regarding communication overhead 

PSRM shows a significant communication overhead reduction when com- 
pared to the baselines (Figure 6.8). The mean and IQR of PSRM commu- 
nication overhead is lower than the baselines but the larger variance of the 
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Figure 6.8 Communication overhead of the remapping approaches. 


results is due to the different range of workloads and their effect on the 
QN differentiation cycle in each experimental run. The maximum overhead 
is comparable to that of CCPRMy2. Both the CCPRMy2 and centralised 
management show a higher and narrower distribution of communication over- 
head than PSRM. A higher upper whisker in PSRM shows that under certain 
workload scenarios the overhead can be costly and similar to the CCPRMy2 
baseline. The lower communication overhead distribution of the centralised 
manager when compared with CCPRMy, is due to the lack of inter-cluster 
communication. In the centralised management scheme communicating tasks 
mapped at the middle of the NoC will suffer due to the network congestion 
caused by the incoming monitoring traffic. Furthermore, these traffic flows 
will occupy longer routes than CCPRM yz. Furthermore, we are shown in [69], 
that the centralised managers’ communication overhead issues become severe 
after the NoC size exceeds 12 x 12. The random mappers’ communication 
overhead is many orders of magnitude lower than the others as it only incurs 
overhead when notifying the task dispatcher regarding remapping decisions. 


6.3.2.4 Comparison regarding processor utilisation 

The PE utilisation distribution shown in Figure 6.9(a), indicates the PEs with 
higher utilisation using lighter shade, while the darker shades show PEs with 
low utilisation levels. The data shown in this plot are normalised such that 
each remapping technique is relative to each other. PSRM shows a slightly 
similar variation in the workload distribution to CCPRMyp> with only a single 
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Figure 6.9 Comparison of PE utilisation for all remapping techniques. (a) Distribution of PE 
utilisation across a 10 x 10 NoC. (b) Histogram of PE busy time (normalised; 20 bins). 
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PE with extremely high utilisation and a few with very low utilisation. The 
curves fitted to the histogram data shown in Figure 6.9(b) indicates that all 
four remapping techniques have a similar spread of workload distribution. 
However, closer examination to the statistical properties of the distributions 
(given in Table 6.2), indicate that centralised management has the lowest 
variance and the mean utilisation. The random remapper has the highest 
variance and mean utilisation. The frequency spikes of the centralised and 
random remappers in Figure 6.9(b) at 0.8, 0.4 and 0.7 probably give rise to 
these statistical properties. PSRM shows a higher mean utilisation and lower 
distribution variance when compared with CCPRMy. 


6.3.3 Outlook 


Overall the results indicate the PSRM technique helps to reduce lateness in 
the video stream jobs and to increase the number of schedulable video streams 
when compared with the CCPRM y2 remapper. It is important to note that this 
improvement, even though is marginal, comes at a much lower communication 
overhead (up to 30% lower than the cluster-based and centralised approaches). 
A higher maximum lateness improvement can be obtained using CCPRMy2, 
but only in 40%-50% of the workload scenarios. Communication overhead 
of CCPRMy2 may grow as the cluster sizes increase, however in the PSRM 
technique this overhead will vary depending on the distribution of QNs in the 
network. Also, unlike in the baseline approaches, in PSRM the pheromone 
signalling message paths are usually short (only a few hops) regardless of 
the NoC size increases. A centralised resource manager can help to evenly 
distribute the workload much better than PSRM, because of its global knowl- 
edge of the PE status and the mapped tasks. However, PSRM shows better 
workload distribution when compared with a cluster-based approach. Unlike 
in the cluster-based management, in PSRM, there are no resource managers 
in the network; each node executes a simple set of rules using only local 
knowledge to collectively improve the performance. The execution cost of 
the remapping event (Algorithm 6.5) is bounded by the number of QNs in 
the local vicinity and the number of tasks mapped on the node. In the cluster 


Table 6.2 PE utilisation distribution statistics. Lower variance (var.) = better workload 
distribution 


mean | var. 
PSRM 0.455 | 0.033 
CCPRMy2 0.454 | 0.034 
Centralised management | 0.447 | 0.032 
Random remapper 0.462 | 0.035 
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based approach each local processor’s execution overhead for management 
functions (such as remapping, inter-core communication, monitoring etc.) 
would increase as the cluster size increases. Unlike in the centralised approach, 
PSRM is decentralised hence has no single point of failure or an isolated 
communication congestion area. Each PE has the capability of performing 
remapping and becoming a QN, hence the level of redundancy in the system 
is greater than in the baseline remappers. 

One of the identified limitations in PSRM is the sensitivity of the param- 
eters. The parameters need to be tuned for a specific network size and can 
produce varying results based on the nature of the workload. Parameters that 
are suitable for a smaller NoC size may not necessarily produce favourable 
results for a larger NoC. The experimental results during the tuning of the 
parameters for the PSRM and CCPRM y, techniques show that the remapping 
frequency has a significant impact on the performance and communication 
overhead. Furthermore, depending on the value of Ton and Tpgcay, PSRM 
will vary in the time it takes to identify suitable QNs in the network, and hence 
better performance results can be seen after longer runs of the algorithm. 


6.4 Summary 


This chapter presented extensions to a fully distributed resource management 
technique based on swarm intelligence. We have shown how such an approach 
can be applied to a multiple video stream decoding application with several 
unknown dynamic workload characteristics, on a NoC-based multicore. With 
low communication overhead, it relies on task-remapping strategies to pro- 
gressively distribute the workload in the network and to reduce the overall job 
lateness. 

The experimental results have shown that the bio-inspired remapper gives 
a marginal (2%—4%) improvement in lateness reduction but incurs 10%-30% 
lower communication overhead and minor improvement to workload distri- 
bution than the baseline cluster-based and centralised management schemes. 
The centralised management allows the system to increase the number of 
total schedulable video streams, but the improvement to the cumulative job 
lateness of the late video streams is poor and the communication overhead 
is higher than in the proposed technique. The benefits of the centralised 
management degrade as the scale of the network and workload increase [4]. 
Results show that the proposed PSRM approach give a marginal benefit in 
reducing the cumulative job lateness of the video streams when compared 
against the CCPRM yz cluster based resource management approach; however 
itis important to note that the improvement is obtained using a significantly less 
(up to 30% lower) communication overhead than the cluster based approach. 
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A reduced communication overhead may lead to lower energy consumption 
[39] and less congested communication network, making PSRM more efficient 
than the cluster-based approach. Furthermore, unlike in the centralised or 
cluster-based approaches the proposed PSRM remapping technique does not 
depend on a single or group of management entities. Each node is independent 
and capable of relocating late tasks to improve the overall job latency, hence 
adopting this technique introduces a high degree of redundancy for NoC-based 
multi/many-cores that require reliable and timely operation. 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 
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Value-Based Allocation 


In a many-core HPC data centers, jobs arrive at different moments of time 
and they need to be serviced by allocating on the available system cores 
at run-time. In doing so, the value (utility) achieved by servicing the jobs 
should be maximized while trying to minimize the overall energy consumption 
during system operation as mentioned earlier. A job may contain a number of 
dependent/independent tasks or processes to be allocated on the system cores. 
The allocation results for each job determine the value to be achieved and 
also energy consumption, and thus allocation process needs to optimize both 
the metrics (value and energy). Optimizing of energy of large scale HPC 
data centers is of paramount importance as there is a huge concern about the 
energy required to operate such systems [112]. The reports indicate the energy 
consumption of data centers to be between 1.1% and 1.5% of the worldwide 
electricity consumption [70]. Thus, both the value and energy consumption 
need to be optimized during resource allocation process. 

Previous researchers have introduced notion of values (economic or 
otherwise) of the jobs to define their importance level [66]. In overload 
situations where demand for available resources is higher than the supply, such 
a notion facilitates in deciding to hold the low value jobs for late allocation 
and allocating limited resources to the high value jobs. The value of a job can 
change over time to reflect the impact of the computation over the business 
processes, which adds complexity to the allocation process. 

Existing dynamic resource allocation approaches allocate dynamically 
arriving jobs to the platform resources by employing light-weight heuristics 
that can find an allocation quickly. There have also been efforts to utilize 
design-time profiled results to facilitate efficient resource allocation and 
reduce the computations at run-time [125]. These efforts seem promising 
to design job-specific-clouds, where the clients (or customers) and their 
jobs to be submitted for execution are pre-defined, which can be realized 
from the historical data. However, they optimize only for value. Further, 
existing approaches optimizing for both value and energy cannot be applied to 
dependent tasks. Since an HPC job may contain a set of dependent tasks, there 
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is a need to devise resource allocation approaches to be applied on dependent 
tasks while optimizing both value and energy. 


7.1 System Model and Problem Formulation 


Figure 7.1 shows our target system model, which is based on typical industrial 
HPC scenario. The system contains a many-core HPC platform that executes 
a set of jobs submitted by various users at different moments of time. The 
jobs are submitted to the platform resource manager that allocates resources 
to them. This section provides a brief overview of the platform and workload 
model along with the problem formulation. 


7.1.1 Many-Core HPC Platform Model 


The HPC platform H P contains a set of nodes (PG,..., PGy), where each 
node (server) contains a set of homogeneous cores, referred to as processing 
elements (PEs), as shown in the bottom part of Figure 7.1. Similar to a typical 
data center, each node represents a physical server. A node n is represented as a 
set of cores Ch, which communicate via an interconnect. Each core is assumed 
to support DVFS (as briefly described in Chapter 3) and thus its voltage and 


Arrival Time 


Platform Resource Manager 
(Resource Allocation for Arrived Jobs) 


Node PG, eee Node PGy 


Figure 7.1 System model adopted in this chapter. A cloud data center containing different 
nodes (servers) with dedicated cores (PEs) to execute jobs submitted by multiple users. 
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frequency can be independently adjusted in order to achieve a balance between 
energy consumption and job execution time. A platform resource manager 
controls access of platform resources and coordinates the execution of jobs 
submitted by the users, which facilitates efficient management of resources 
and incoming requests. 


7.1.2 Job Model 


Each job j in the HPC workload is modelled as a directed graph TG = (T, E), 
where T is the set of tasks of the job and ÈE is the set of directed edges 
representing dependencies amongst the tasks. Figure 7.2(a) shows an example 
job that contains 7 tasks (t1, . . ., t7) connected by a set of edges. Each task 
t e T is associated with its execution time (ExecTime, measured as worst-case 
execution time (WCET)), when allocated on a core operating at a particular 
voltage level. Such information can be obtained from previous executions of 
the tasks in the job from historical data. Each edge e € E represents data that 
is communicated between the dependent tasks. A job j is also associated with 
its arrival time AT}. 


7.1.3 Value Curve of a Job 


For each job j, the value curve VC; is a function of the value of the job to 
the user depending on the completion time of the job [66]. The value curve 
is usually a monotonically-decreasing function and trends towards zero with 
the increasing completion time, as shown in Figure 7.2(b). We assume a value 
curve is given for each job, as this reflects its business importance as assessed 
by the end user (i.e., domain specific economic model). The description of 
the economic model is orthogonal to our approach and out of scope of this 
chapter. 
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Figure 7.2 An example job model and its value curve. 
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Each job is considered to have a soft deadline [28]. This implies that the 
violation of deadline does not make the computation irrelevant, but reduces 
1ts value for the user [34, 62, 66]. The reduction in value due to delay can be 
determined by observing the value in the value curve atthe delayed completion 
time. Deadlines missed by large margins may result in zero value and thus the 
computation becomes useless for the user. Further, the energy spent on such 
computation can be considered as wasted. Therefore, the job request should 
be rejected if no (zero) value can be obtained by executing it. 


7.1.4 Energy Consumption of a Job 


The total energy consumption (Fota) of a job is computed as the sum of 
dynamic and static energy as follows. 


total = Edynamic + Estatic (7.1) 


The dynamic energy consumption for all the tasks in the job is estimated from 
Equation (7.2). 


Eaynamic = Ss (ExecTime|t] > cy) : (pow — cy)] (7.2) 
vteT 


where ExecTimelt] — cy, and pow — c, are the execution time of task t 
mapped on core c operating at voltage v, and respective power consumption, 
respectively. The ExecT'ime measures are provided in the job model. It is 
assumed that the power consumption at different operating voltages is known 
in advance and taken from chip manufacturer’s data sheet. 

The Estatic for each core is computed as the product of overall execution 
time of the job and static power consumption of the used cores. For p used 
cores, total static energy is computed as p - Estatic, and unused cores are 
considered as power gated so that they do not contribute to the overall energy 
consumption. 


7.1.5 Problem Formulation 


In an HPC system (e.g., Figure 7.1), jobs (j1,..., jm) arriving at different 
moments of time submitted by various users need to be efficiently allocated 
on the resources (cores) of the platform nodes (PG},..., PG y). The resource 
allocation problem targeted in this paper is to jointly optimize value and energy 
while servicing arrived jobs. It is assumed that the tasks of a job are allocated 
to only one node (server) in order to avoid huge communication delay between 
different nodes. To summarize, the targeted problem considers the following 
set of input, constraints and objective. 
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e Input: Workload, i.e., Job set (j1, . . ., jm), Value curve of each job VC;, 
Arrival time of each job AT; (j € 1,...,.M), Cores of the HPC platform 
nodes (PG1,..., PGy), Voltage levels (v1,..., v1) supported by each 
core. 

e Constraints: Limited resources (cores) on each node of HP. 

e Objective: Maximize overall value Valtiota and minimize energy 
consumption Fota. 


For an arrived job, the allocation process followed by the global resource 
manager needs to identify the node to execute the job, tasks to cores allocation 
inside the node, and the voltage/frequency levels of the cores executing tasks of 
the job. We assume negligible time for switching between voltage/frequency 
levels of a core as it is in the order of nanoseconds while tasks execution is in 
the order of minutes or hours [49]. Since there are several possible allocations 
(tasks to cores assignment) for a job and several voltage scaling (VS) options 
for each allocation, exploring the complete design space to identify the optimal 
design in terms of value and energy might not be feasible within acceptable 
time. Therefore, only efficient allocations and appropriate VS options need 
to be evaluated. Further, for dependent tasks, applying VS on a core is rather 
challenging as one needs to capture the VS effect on the execution of dependent 
tasks allocated on other cores. 


7.2 The Solution 


This section describes solutions in order to address the aforementioned 
problem. In order to allocate platform cores to the incoming jobs at run-time, 
the platform resource manager is invoked to find allocations. The manager 
follows profiling or non-profiling based approach, as shown in Figure 7.3. 
The details of these approaches are as follows. 


7.2.1 Profiling Based Approach (PBA) 


This approach uses design-time profiling results of the jobs in the historical 
data to perform run-time resource allocation for the incoming jobs, as shown 


Incoming Jobs Incoming Jobs 
& Value Curves HPC Platform & Value Curves HPC Platform 
Y 
Profiling Run-time Platform Run-time Platform 
Results Resource Manager Resource Manager 
| 
aa Wes -N 
( Allocation Result (Allocation Result ) 


(a) Profiling Based Approach (b) Non-profiling Based Approach 
Figure 7.3 Profiling and non-profiling based approaches. 
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in Figure 7.3(a). For each job, the profiling process identifies the allocation 
and voltage/frequency levels leading to optimized response time (determines 
value) and energy consumption when utilizing different amount of computing 
power in terms of number of cores. The response time is calculated as the 
difference between the end and start time of the job execution after allocating 
resources to it and should be minimized to optimize value. To jointly optimize 
value and energy, we consider to minimize the product of response time 
and energy consumption. At different number of cores, the allocation and 
voltage/frequency levels leading to minimum product value are identified by 
employing a genetic algorithm (GA) based evaluation, similarly as in [117]. 
The number of cores is varied from one to the number of tasks in the job. Such 
variation can exploit all the potential parallelism present in the job as each 
task can occupy only one core. For each job, the allocation, voltage/frequency 
levels, value corresponding to the response time and energy consumption at 
different number of cores are stored as the profiling results. 

To perform resource allocation by using the profiling results, the manager 
follows Algorithm 7.1. The algorithm takes profiling results of the jobs from 
the storage along with their value curves and arrival times, and the HPC 
Platform AP as input and identifies the value and energy optimizing allocation 
for each job based on the number of available cores at different nodes in the 
platform. The algorithm checks mainly for two events as follows: 1) any 
already allocated job(s) finish execution to update the platform resources 
(lines 1-3), and 2) any job(s) arrive into the platform to put into a job queue 
(lines 4-6). If any of the two events or both of them occurs, the algorithm tries 
to perform resource allocation for the queues job(s) having non-zero values 
(lines 7-17). 

To perform resource allocation for all valuable queued jobs (i.e., jobs 
having positive values), all of them (count = 0 to JobQueue.size(), line 8) 
are tried to be allocated on the platform resources as along as any core is 
available. It is ensured that a queued job having zero value at the allocation 
time is dropped from the queue as no value can be made out of it. The 
allocation process continues until all the arrived jobs are allocated or dropped 
due to having zero value while waiting in the job queue. First, bids (in terms 
of number of available cores) from different platform nodes are collected, 
then the maximum bid (max Bid) and the corresponding node is selected 
(line 9). Choosing such a node to use its cores helps to achieve better load 
balancing amongst nodes and thus better resource utilization. In case more 
than one nodes have the same amount of bid, any of them is chosen. If 
the estimate of max Bid is greater than zero (max Bid > 0, line 10), i.e., 
at least one core is available in the platform, the value/energy estimates of 
jobs utilizing maz Bid cores are computed and the job leading to maximum 
value per energy consumption (maxValuePerEnergyJob) is selected to 
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Algorithm 7.1 Profiling Based Resource Allocation 


Input: Incoming Jobs with arrival times, Jobs’ profiling results and value 
curves, HPC Platform HP. 
Output: Resource Allocation for Incoming Jobs. 


1 if allocated_job(s) finish execution then 

2 Update platform resources; 

3 end 

4 if job(s) arrive then 

5 Put the job(s) in JobQueue; 

6 end 

7 if JobQueue contains job(s) having positive values then 
8 for count = 0 to JobQueue.size() do 

9 Collect bids from all nodes and select max Bid; 

10 if max Bid > 0 then 

u Compute value /energy estimates of unscheduled jobs when 
utilizing maz Bid cores; 

12 Select mazValue Per Energy Job and its value, energy, 
allocation, and voltage/ frequency levels from profiling 
results; 

13 Schedule max Value Per Ener gyJob on node having 
maz Bid cores by following the allocation to perform 
execution at voltage / frequency levels; 

14 Update platform resources; 

15 end 

16 end 


17 end 


be scheduled to the node having maz Bid cores by following the allocation 
and voltage/frequency levels leading to the optimized value and energy. The 
computation of value/energy for each job considers its value at the allocation 
time and the exact number of cores to be used by the job computed as minimum 
between maz Bid and the number of cores to be used to achieve maximum 
value/energy. The platform resources are updated after scheduling each job 
to have up to date resources’ availability information for the next allocation 
instance. This helps to achieve an accurate and efficient allocation. Similar 
process is repeated for all the arrived jobs. 

For each job, this approach selects (from the profiling results) allocation 
and voltage/frequency levels leading to maximum value/energy, and thus both 
the value and energy consumption are optimized. 


7.2.2 Non-profiling Based Approach (NBA) 


The NBA approach does not use profiling results as no historical pattern of 
jobs is available to perform advance profiling. Rather, all the computations 
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are performed at run-time. This approach is suitable to the scenarios when the 
jobs to be executed are unknown in advance, i.e., no historical pattern of jobs 
is available. 

The steps followed by the NBA are similar to PBA and sketched in 
Algorithm 7.2. Here, if max Bid is greater than zero (maxBid > 0), the 
following two main steps are employed: i) Compute values of unscheduled 
jobs by finding allocations on mazBid cores (line 6), and ii) Identify 
voltage/frequency levels of used cores to execute allocated tasks to maximize 
value over energy (line 8), which are described subsequently. 

In step i), firstly, an appropriate allocation for each job is identified by 
allocating on max Bid cores. The allocation considers the exact number of 
cores to be used, which is the minimum between maz Bid cores and the 
number of cores equivalent to the number of tasks in the job. The exact 
number of cores could be higher than that of PBA as no profiling information is 
available to identify it exploiting the maximum parallelism. To find an efficient 
allocation, we try to balance load across the used cores. Every task of the job 
is allocated to a core such that the processing load is balanced over the cores. 
In case the number of tasks in the job is higher than the number of cores, the 
approach allocates highly communicating tasks on the same core to reduce the 


Algorithm 7.2 Non-profiling Based Resource Allocation 


Input: Incoming Jobs with arrival times, Value curves of Jobs, 
HPC Platform HP. 
Output: Resource Allocation for Incoming Jobs. 


1 Steps 1 to 6 of Algorithm 7.1; 

2 if JobQueue contains job(s) having positive values then 

3 for count = 0 to JobQueue.size() do 

4 Collect bids from all nodes and select max Bid; 

5 if max Bid > 0 then 

6 Compute values of unscheduled jobs by finding allocations on 
maz Bid cores; 

7 Select max V aluableJob, its allocation and respective 
value; 

8 Identify voltage/frequency levels of used cores in the 
allocation to execute allocated tasks to optimize value and 
energy; 

9 Schedule maxV aluable Job on node having max Bid cores 
by following the allocation to perform execution at found 
voltage/frequency levels; 

10 Update platform resources; 
u end 
12 end 


13 end 


7.2 The Solution 127 


communication overhead. These considerations can lead to minimal response 
time and thus completion time of the job, resulting in maximum value. After 
finding the allocation, the value is computed as the value in the corresponding 
value curve at the completion time by taking the arrival time into account. 
Similarly, value achieved by each job is computed. 

From all the jobs, the one leading to the maximum value, i.e., 
mazxV aluableJob, corresponding allocation and value is selected (line 7). 
Then, voltage/frequency levels are identified in step ii) as described subse- 
quently. 

Step ii) follows Algorithm 7.3, which takes the set of voltage scaling 
(VS) levels V available for cores as input and identifies the VS levels to be 
applied on cores to execute allocated tasks. For each task t, available VS 
levels are applied, and response time and value of the job at its completion 
is computed. From here onwards, applying voltage scaling on a task implies 
applying voltage scaling on the allocated core for the task. Similarly, VS level 
of a task implies VS level of the allocated core to execute the task. The value 
at completion is estimated by looking into the corresponding value curve 
while taking the arrival time of the job into account. If an applied VS on a 
task is valuable (valuejob_completition > 0), then total energy consumption of 
the job is calculated from Equation (7.1). Next, value at per unit of energy 


Algorithm 7.3 Voltage/frequency Identification 
Input: V = {v,|Vi € [1,--- ,n)). 
Output: VS levels of tasks. 

1 repeat 

2 for each task t whose VS level is not fixed do 

3 for each VS level v; do 

4 Apply VS v; on t, and compute response_time 

5 and Value job completiition; 

6 if valuesob completition > 0 then 

7 Calculate total energy consumption Erotal 

8 (by Equation 7.1) when applying v; on t; 

9 ValPerUnitEnerg = VAINE poa, 


Etotal 
10 end 
1 end 
12 end 
13 Find task tf & VS level vy corresponding to maximum 
ValPerUnitEnerg; 


14 Fix voltage of t to vy; 
15 until VS levels of all tasks are not fixed; 
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consumption (Val PerUnitEnerg) is computed. Thereafter, the task and its 
VS level corresponding to maximum ValPerUnitEnerg is found to fix 
the voltage level to execute the task. The same process is repeated to find 
VS levels of other tasks. Once voltage/frequency levels are identified, the 
mazV aluableJob is scheduled on the node having maz Bid cores based on 
the allocation to perform execution at the identified voltage/frequency levels 
(Algorithm 7.2). 


7.3 Evaluations 


The proposed value and energy optimizing resource allocation approaches 
have been implemented in a C++ prototype and integrated with a SystemC 
functional simulator. As a workload, job models from historical data of an 
industrial HPC system at High Performance Computing Center Stuttgart 
(HLRS) are considered. The jobs in the workload have varying arrival time. 
It is considered that higher numbers of jobs arrive in peak times as compared 
to off-peak times. To sufficiently stress the platform, we consider all the jobs 
arriving over a day, i.e., 24-hour period. Each job contains a set of tasks having 
predefined connections (edges) amongst them that determines dependencies. 
For each task, the worst-case execution time (WCET) is known a priori and 
specified in the job model. The number of tasks in the jobs varies from 5 to 10. 
Further, it is assumed that the value curve of each job is given. 

To evaluate our approaches under different load conditions, we conducted 
experiments with varied arrival rates of jobs while keeping higher number 
of arrivals during peak times over off-peak times. We have considered low, 
moderate and high arrival rates, where jobs arrive in the orders of a few 
seconds, dozens of seconds and minutes, respectively. It is assured that the 
total number of jobs for different arrival rates remains the same as the number 
of jobs considered for 24 hours. 

To evaluate our approaches for different number of available servers 
(nodes), varying number of nodes are considered in the HPC platform. Further, 
the number of cores at each node is also varied to evaluate the approaches for 
assorted chip manufacturing technologies, where different number of cores 
can be integrated within a physical chip. The number of cores is varied such 
that it covers a broad spectrum of technologies including advanced servers to 
be available in future. The platform cores are assumed as the cores of Intel 
Core M processor, which supports 6 voltage/frequency levels of operation. 
However, any other type of core and higher number of voltage/frequency 
levels can be considered. 

The main evaluated performance metrics are value and energy consump- 
tion, which are overall value achieved by executing the arrived jobs and energy 
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consumed by the platform cores to execute the jobs, respectively. We also 
evaluate the percentage of rejected jobs that are removed from the job queue 
as their value becomes zero before the resources become available to allocate 
them. The rejected jobs also include jobs achieving zero value after their 
execution, which can be prevented by employing proper admission control 
and schedulability analysis. 


7.3.1 Experimental Baselines 


There are algorithms reported in the literature that apply DVFS to execute jobs. 
However, most of them optimize either only for [141] or energy [124], and 
both value and energy optimizing approaches do not consider jobs containing 
dependent tasks [66]. 

We compare results obtained from our approaches (PBA and NBA) to 
those of [141] and [124]. These approaches are considered for comparison 
as they can be applied to jobs containing dependent tasks and DVFS can 
be applied. In [141], the optimization is performed to optimize only value, 
1.e., no DVES is applied, and the cores are assumed to operate at the highest 
supported voltage level. This approach chooses the maximum value job first 
to optimize the overall value and has been referred to as ValOpt. It helps to 
recognize energy savings by all the approaches applying DVFS. To employ this 
approach, the voltage/frequency identification step (line 8, in Algorithm 7.2) 
has been removed. 

The approach of [124] identifies voltage/frequency levels of cores to 
execute the tasks scheduled on them in order to optimize only energy 
consumption. Therefore, it has been extended to optimize both the value 
and energy for a fair comparison. To employ this approach, the greedy algo- 
rithm of [124] is called for voltage/frequency identification in Algorithm 7.2 
(line 8). In this algorithm, all the tasks scheduled on a core execute on a 
fixed identified voltage/frequency level, referred to as fixing cores power 
states (FCPS), as shown in example Figure 7.4. The voltage/frequency 
identification follows a greedy heuristic, where voltages of cores are fixed 
one by one during consecutive iterations. When employing voltage/frequency 
identification of [124], the approach is referred to as NBA-FCPS. Our approach 
identifies voltage/frequency levels of tasks in the similar manner, where 
tasks scheduled on a core can be executed on different voltages, referred 
to as fixing tasks power states (FTPS), as shown in example Figure 7.4. 
In this case, our NBA approach has been referred to as NBA-FTPS. It 
should also be noted that the run-time computation overhead of NBA 
approach has been considered to capture accurate achieved value after the 
job completion. 
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Figure 7.4 Voltage/frequency identification by FCPS and FTPS. 


7.3.2 Value and Energy Consumption at Different Arrival Rates 


Figure 7.5 shows the overall value and energy consumption when various 
approaches are employed for different arrival rates of jobs. A high arrival rate 
indicates that the jobs arrive quite frequently, whereas less frequently in low 
arrival rate. The value and energy estimates are normalized with respect to 
(w.r.t.) the value and energy by ValOpt approach at high arrival rate. The shown 
results have been computed for 3 nodes, where each node contains 8 cores. 
A couple of observations can be made from the figure. 7) The value obtained 
by all the approaches increases from high to low arrival rates as more jobs are 
processed before their value becomes zero due to late availability of cores. 
2) The value obtained by PBA approach is always higher than that of other 
approaches due to joint optimization effect. On an average, PBA achieves 5.6% 
higher value than that of ValOpt. However, the joint optimization also leads 
to higher energy consumption when jobs arrival rate is not high. For the sake 
of both value and energy optimization, PBA is recommended to be employed. 


Value E NBA-FCPS NBA-FTPS PBA 


Energy —>—NBA-FCPS - +- NBA-FTPS ----0---- PBA 


Value 
(normalized w.r.t. ValOpt at High Rate) 


Energy 
(normalized w.r.t ValOpt at High Rate) 
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Jobs Arrival Rate 
Figure 7.5 Value and energy at different arrival rates. 
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3) The energy consumption by NBA-FCTS and NB-FTPS is close to each 
other and lower than that of ValOpt. On an average, NBA-FCTS and PBA 
reduce energy consumption by 15.8% and 5.8%, respectively, when compared 
to ValOpt. Therefore, for the sake of both value and energy optimization, PBA 
is recommended to be employed. 


7.3.3 Value and Energy Consumption with Varying 
Number of Nodes 


Figure 7.6 shows the influence of the number of nodes (servers) on the overall 
value and energy consumption. At each node, a total of 8 cores are considered. 
The shown results are for high arrival rate of the jobs. The value and energy 
results are normalized w.r.t. the value and energy by ValOpt approach at 2 
nodes. It can be observed that the overall value by all the approaches increases 
with the number of nodes due to increased processing capability leading to 
completion of higher number of jobs before their value becomes zero. It can 
also be observed that PBA achieves higher overall value than other approaches. 
Further, on an average, PBA performs better than other approaches if both the 
value and energy metrics are jointly evaluated as value divided by energy. 


7.3.4 Value and Energy Consumption with Varying 
Number of Cores in Each Node 


Figure 7.7 shows the overall value and energy consumption when number of 
cores at each node are varied for a total of 3 considered nodes. The jobs arriving 
at high rate are considered. The value and energy results are normalized w.r.t. 
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Figure 7.6 Value and energy with varying number of nodes. 
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Figure 7.7 Value and energy with varying number of cores at each node. 


the value and energy by ValOpt approach at 2 cores. A couple of observations 
can be made from the figure. First, the value by all the approaches increases 
with the number of cores due to increased processing capability leading 
to completion of higher number of jobs before their value becomes zero. 
Second, PBA achieves higher overall value than other approaches and better 
results when both the value and energy need to be considered. Third, in case 
profiling of jobs is not possible, i.e., PBA cannot be applied, NBA-FTPS can 
be employed to achieve a better trade-off between value and energy over other 
approaches. 


7.3.5 Percentage of Rejected Jobs 


Table 7.1 shows the rejected jobs (%) at different arrival rates when various 
approaches are employed. The average over different arrival rates is also 
shown for all the approaches. The tabulated results have been computed by 
considering 3 nodes, where each node contains 8 cores. It can be observed 
that, on an average, our proposed approaches NBA-FTPS and PBA reject 
lesser number of jobs as compared to baseline approaches. The PBA has the 
lowest rejection of jobs as each job is allocated on the exact number of cores 


Table 7.1 Percentage of rejected jobs at different arrival rates 
ValOpt NBA-FCPS NBA-FTPS PBA 


High 49.0% 49.4% 48.8% 46.8% 
Medium | 29.2% 30.8% 30.0% 22.0% 
Low 13.0% 12.8% 12.2% 00.0% 


Average | 30.4% 31.0% 30.3% 22.9% 
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exploiting all the potential parallelism with the help of design-time profiled 
results. This result in cores availability for higher number of jobs before their 
value become zero and thus lowers rejections. It should be noted that rejection 
rate by PBA for low arrival rate is not always zero and varies with number of 
cores/nodes. 


7.4 Related Works 


The dynamic resource allocation process usually employs a heuristic following 
some fundamental optimization procedure (e.g., incremental dynamic allo- 
cation) to identify an efficient allocation for each job at run-time. Several 
heuristics have been proposed to accomplish this aim [126]. These heuristics 
optimize one or several performance metrics, e.g., response time and energy 
consumption. In overload situation, these heuristics can lead to starvation, 
missed deadlines, and reduced throughput. Further, these heuristics do not 
take into account any notion of values of jobs to users and thus they do not 
optimize the overall value achieved by executing different jobs. 

Market-inspired resource allocation heuristics are proven to provide 
promising results in the overload situation that is normally encountered in 
HPC system [156]. The heuristics employ notion of values of jobs, where 
values represent importance levels. Some researchers assume fixed value of a 
job [141], whereas others consider values that can change with time, described 
with so-called value curve of the job [25, 66]. In such curve, the value of a job 
normally decreases with computation time and reflects the importance level 
over the business process. 

Market-inspired heuristics allocate jobs in several ways. For example, the 
highest value job is chosen first [141]. This approach might lead to small 
amount of available resources if a high value job requires a large amount 
of resources. To overcome above problem, the job having maximum value 
density can be chosen first [79], where the value density is computed as value 
divided by the amount of required computational resources Another heuristic 
to choose the job having minimum remaining value first is also proposed 
[24]. The remaining value is calculated as the area under the value curve 
from the current time to the time when its value is zero. These heuristics 
try to optimize overall value, but they do not consider energy consumption 
optimization. Further, they do not consider DVFS capable cores, which 
provide opportunities to reduce energy consumption. 

Energy optimization approaches for HPC data centers have focused mainly 
on virtual machines (VMs) consolidation and DVES. In consolidation, VMs 
with low utilization are placed together on a single host so that other used hosts 
can be freed to shut them down [13, 132, 148]. DVFS based approaches have 
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been explored to reduce energy consumption is several areas, e.g., clusters 
[114, 149], web servers [142] and HPC data centers [28]. The approaches 
for HPC data centers (e.g., [28]) do not consider jobs containing dependent 
tasks. For other application domains, DVFS techniques for dependent tasks 
are explored (e.g., [124]), but optimization is not performed for value. 

Some heuristics considering DVFS and optimizing both the value and 
energy consumption are reported in [66]. However, they consider independent 
tasks or jobs containing independent tasks. There are some additional multi- 
criteria optimization approaches, but they perform static resource allocation 
[50, 105]. Further, in dynamic resource allocation process, they do not use 
design-time profiling results, which can provide optimized value and energy. 
In contrast, the reported profiling and non-profiling based dynamic resource 
allocation approaches in this chapter consider jobs containing dependent tasks 
and jointly optimizes for both value and energy while applying DVFS. 


7.5 Summary 


This chapter proposed value and energy optimizing resource allocation 
approaches for HPC data centers. It has been shown that the approaches com- 
bine identification of efficient allocation and appropriate voltage/frequency 
levels to jointly optimize value and energy consumption. Whilst existing 
approaches focus on methods like server consolidation and DVES, they do 
not consider jobs containing dependent tasks. It has been shown that the 
proposed approach is able to significantly reduce energy consumption and 
improve value while applying DVFS for jobs containing dependent tasks. 
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